Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R....

41
Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001

Transcript of Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R....

Page 1: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

Professor Alvin R. Lebeck

Computer Science 220

Fall 2001

Page 2: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 2© Alvin R. Lebeck 2001

Administrative

• Some textbooks here, more to arrive Friday afternoon

• Read Chapter 3

• Homework #1 Due September 11– Simple scalar, read some of the documentation first

– See web page for details

– Questions, contact Fareed ([email protected])

• Policy on Academic Integrity (Cheating..)

• Homework– Discussion of topics is encouraged, peers are great resource

– But, hand in your work

• Projects– Work in pairs, learn how to collaborate.

Page 3: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 3© Alvin R. Lebeck 2001

Review

• Designing to Last through TrendsCapacity Speed

Logic 2x in 3 years 2x in 3 years

DRAM 4x in 3 years 1.4x in 10 years

Disk 4x in 3 years 1.4x in 10 years

• Time to run the task– Execution time, response time, latency

• Tasks per day, hour, week, sec, ns, …– Throughput, bandwidth

• “X is n times faster than Y” means ExTime(Y) Performance(X)

--------- = --------------

ExTime(X) Performance(Y)

Page 4: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 4© Alvin R. Lebeck 2001

The Danger of Extrapolation

• Dot-com stock value

• Technology Trends

• Power dissipation?• Cost of new fabs?• Alternative

technologies?–GaAs–Optical

Page 5: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 5© Alvin R. Lebeck 2001

Amdahl’s Law

ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced

Speedupoverall =ExTimeold

ExTimenew

Speedupenhanced

=

1

(1 - Fractionenhanced) + Fractionenhanced

Speedupenhanced

Page 6: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

6© Alvin R. Lebeck 2001

Review: Performance

Countn Instructio

I F e wherFCPI CPI

I CPI Time Cycle timeCPU

Countn Instructio

Cycles

Countn Instructio

RateClock timeCPU CPI

ii

n

1 iii

i

n

1 ii

Invest Resources where time is Spent!

“Average Cycles Per Instruction”

“Instruction Frequency”

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

Page 7: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 7© Alvin R. Lebeck 2001

Example

Base Machine (Reg / Reg)Op Freq CyclesALU 50% 1Load 20% 2Store 10% 2Branch 20% 2

Typical Mix

Add register / memory operations:– One source operand in memory– One source operand in register– Cycle count of 2

Branch cycle count to increase to 3.

What fraction of the loads must be eliminated for this to pay off?

Page 8: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 8© Alvin R. Lebeck 2001

Example Solution

Exec Time = Instr Cnt x CPI x Clock

Op Freq Cycles CPI Freq Cycles CPI

ALU .50 1 .5 .5 – X 1 .5 – X

Load .20 2 .4 .2 – X 2 .4 – 2X

Store .10 2 .2 .1 2 .2

Branch .20 2 .3 .2 3 .6

Reg/Mem X 2 2X

1.00 1.5 1 – X (1.7 – X)/(1 – X)

CPINew must be normalized to new instruction frequency

CyclesNew

InstructionsNew

Page 9: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 9© Alvin R. Lebeck 2001

Example Solution

Exec Time = Instr Cnt x CPI x Clock

Op Freq Cycles Freq Cycles

ALU .50 1 .5 .5 – X 1 .5 – X

Load .20 2 .4 .2 – X 2 .4 – 2X

Store .10 2 .2 .1 2 .2

Branch .20 2 .3 .2 3 .6

Reg/Mem X 2 2X

1.00 1.5 1 – X (1.7 – X)/(1 – X)

Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPINew x ClockNew

1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X)

1.5 = 1.7 – X

0.2 = X

ALL loads must be eliminated for this to be a win!

Page 10: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 10© Alvin R. Lebeck 2001

Programs to Evaluate Processor Performance

• (Toy) Benchmarks– 10-100 line program

– e.g.: sieve, puzzle, quicksort

• Synthetic Benchmarks– Attempt to match average frequencies of real workloads

– e.g., Whetstone, dhrystone

• Kernels– Time critical excerpts of real programs

– e.g., Livermore loops

• Real programs– e.g., gcc, compress, database, graphics, etc.

Page 11: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 11© Alvin R. Lebeck 2001

Benchmarking Games

• Differing configurations used to run the same workload on two systems

• Compiler wired to optimize the workload

• Test specification written to be biased towards one machine

• Workload arbitrarily picked

• Very small benchmarks used

• Benchmarks manually translated to optimize performance

Page 12: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 12© Alvin R. Lebeck 2001

Common Benchmarking Mistakes

• Not validating measurements• Collecting too much data but doing too little analysis• Only average behavior represented in test workload• Loading level (other users) controlled inappropriately• Caching effects ignored• Buffer sizes not appropriate• Inaccuracies due to sampling ignored• Ignoring monitoring overhead• Not ensuring same initial conditions• Not measuring transient (cold start) performance• Using device utilizations for performance

comparisons

Page 13: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 13© Alvin R. Lebeck 2001

SPEC: System Performance Evaluation Cooperative

• First Round 1989– 10 programs yielding a single number

• Second Round 1992– SpecInt92 (6 integer programs) and SpecFP92 (14 floating point

programs)» Compiler Flags unlimited. March 93 of DEC 4000 Model 610:

– spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=memcpy(b,a,c)”

– wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200– nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas

• Third Round 1995– Single flag setting for all programs; new set of programs

“benchmarks useful for 3 years”

• SPEC2000: two options 1) specific flags 2) whatever you want

Page 14: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 14© Alvin R. Lebeck 2001

SPEC First Round

• One program: 99% of time in single line of code

• New front-end compiler could improve dramatically

Benchmark

SP

EC

Pe

rf

0

100

200

300

400

500

600

700

800gcc

epre

sso

spic

e

doduc

nasa

7 li

eqnto

tt

matr

ix300

fpppp

tom

catv

Page 15: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 15© Alvin R. Lebeck 2001

How to Summarize Performance

• Arithmetic mean (weighted arithmetic mean) tracks execution time: (Ti)/n or (Wi*Ti)

• Harmonic mean (weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time: n/ (1/Ri) or 1/ (Wi/Ri)

• Normalized execution time is handy for scaling performance

• But do not take the arithmetic mean of normalized execution time, use the geometric mean ( Ri

1/n)

Page 16: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

16© Alvin R. Lebeck 2001

Reporting Results

• Reproducibility

• List everything another researcher needs to duplicate the results

• May include archiving your simulation/software infrastructure

• Processor, cache hierarchy, main memory, disks, compiler version and optimization flags, OS version, application inputs, etc.

Page 17: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 17© Alvin R. Lebeck 2001

Performance Evaluation

• Given sales is a function of performance relative to the competition, big investment in improving product as reported by performance summary

• Good products created when you have:– Good benchmarks

– Good ways to summarize performance

• If benchmarks/summary inadequate, then choose between improving product for real programs vs. improving product to get more sales;Sales almost always wins!

• Ex. time or bandwidth is the measure of computer performance!

• What about cost?

Page 18: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

18© Alvin R. Lebeck 2001

Integrated Circuit Costs

dies TestArea Die2

diameter wafer

Area Die2

diameter wafer

waferper Dies

Yield Die Waferper Dies

Cost WaferDieCost

Yield Test Final

Cost Packaging Cost Testing Cost Die Cost IC

2

area Die area unit per Defects

1 yield Wafer Yield Die

Die Cost goes roughly with die area4

Page 19: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 19© Alvin R. Lebeck 2001

Real World Examples

Chip Metal Line Wafer Defect Area Dies/ Yield Die Cost layers width cost /cm2 mm2 wafer

386DX 2 0.90 $900 1.0 43 360 71% $4

486DX2 3 0.80 $1200 1.0 81 181 54% $12

PowerPC 601 4 0.80 $1700 1.3 121 115 28% $53

HP PA 7100 3 0.80 $1300 1.0 196 66 27% $73

DEC Alpha 3 0.70 $1500 1.2 234 53 19% $149

SuperSPARC 3 0.70 $1700 1.6 256 48 13% $272

Pentium 3 0.80 $1500 1.5 296 40 9% $417

– From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15

Page 20: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 20© Alvin R. Lebeck 2001

Cost/PerformanceWhat is Relationship of Cost to Price?

• Component Costs• Direct Costs (add 25% to 40%) recurring costs: labor,

purchasing, scrap, warranty

• Gross Margin (add 82% to 186%) nonrecurring costs: R&D, marketing, sales, equipment maintenance, rental, financing cost, pretax profits, taxes

• Average Discount to get List Price (add 33% to 66%): volume discounts and/or retailer markup

ComponentCost

Direct Cost

GrossMargin

AverageDiscount

Avg. Selling Price

List Price

15% to 33%6% to 8%

34% to 39%

25% to 40%

Page 21: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

Instruction Set Architecture

Page 22: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 22© Alvin R. Lebeck 2001

Instruction Set Architecture

• 1950s to 1960s: Computer Architecture Course Computer Arithmetic

• 1970 to mid 1980s: Computer Architecture Course Instruction Set Design, especially ISA appropriate for compilers

• 1990s: Computer Architecture CourseDesign of CPU, memory system, I/O system, Multiprocessors

• 2000s: Computer Architecture Course– Power issues

– Wire delays (distributed microarchitecture)

– New technologies

– New applications (media, network, etc.)

Page 23: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 23© Alvin R. Lebeck 2001

Computer Architecture?

. . . the attributes of a [computing] system as seen by the programmer, i.e. the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation.

Amdahl, Blaaw, and Brooks, 1964

SOFTWARESOFTWARE

Page 24: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 24© Alvin R. Lebeck 2001

Towards Evaluation of ISA and Organization

instruction set

software

hardware

Page 25: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 25© Alvin R. Lebeck 2001

Interface Design

A good interface:

• Lasts through many implementations (portability, compatability)

• Is used in many differeny ways (generality)

• Provides convenient functionality to higher levels

• Permits an efficient implementation at lower levels

Interfaceimp 1

imp 2

imp 3

use

use

use

time

Page 26: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 26© Alvin R. Lebeck 2001

Evolution of Instruction Sets

Single Accumulator (EDSAC 1950)

Accumulator + Index Registers(Manchester Mark I, IBM 700 series 1953)

Separation of Programming Model from Implementation

High-level Language Based Concept of a Family(B5000 1963) (IBM 360 1964)

General Purpose Register Machines

Complex Instruction Sets Load/Store Architecture

RISC

(Vax, Intel 432 1977-80) (CDC 6600, Cray 1 1963-76)

(Mips,Sparc,88000,IBM RS6000, . . .1987)

Page 27: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 27© Alvin R. Lebeck 2001

Evolution of Instruction Sets

• Major advances in computer architecture were typically associated with landmark instruction set designs– Ex: Stack vs GPR (System 360)

• Design decisions must take into account:– technology

– machine organization

– programming langauges

– compiler technology

– operating systems

• And they in turn influence these

Page 28: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 28© Alvin R. Lebeck 2001

Design Space of ISA

Five Primary Dimensions• Number of explicit operands ( 0, 1, 2, 3 )

• Operand Storage Where besides memory?

• Effective Address How is memory location specified?

• Type & Size of Operands byte, int, float, vector, . . .

How is it specified?

• Operations add, sub, mul, . . .

How is it specifed?

Other Aspects• Successor How is it specified?

• Conditions How are they determined?

• Encodings Fixed or variable? Wide?

• Parallelism

Page 29: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 29© Alvin R. Lebeck 2001

ISA Metrics

• Aesthetics:• Regularity (Orthogonality)

– No special registers, few special cases, all operand modes available with any data type or instruction type

• Primitives not solutions• Completeness

– Support for a wide range of operations and target applications

• Streamlined– Resource needs easily determined

• Ease of compilation (programming?)• Ease of implementation• Scalability• Density (Network BW and Power Consumption)

Page 30: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 30© Alvin R. Lebeck 2001

Basic ISA Classes

Accumulator:1 address add A acc acc + mem[A]

1+x address addx A acc acc + mem[A + x]

Stack:0 address add tos tos + next (JAVA VM)

General Purpose Register:2 address add A B A A + B

3 address add A B C A B + C

Load/Store:3 address add Ra Rb Rc Ra Rb + Rc

load Ra Rb Ra mem[Rb]

store Ra Rb mem[Rb] Ra

Page 31: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 31© Alvin R. Lebeck 2001

Stack Machines

• Instruction set: +, -, *, /, . . .push A, pop A

• Example: a*b - (a+c*b)push apush b*push apush cpush b*+-

A B

AA*B

-

+

aa b

*

b

*

c

A*BA*B

A*B

A

AC

A*B

A A*BAC B C*B + * - + *

Page 32: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 32© Alvin R. Lebeck 2001

The Case Against Stacks

• Performance is derived from the existence of several fast registers, not from the way they are organized

• Data does not always “surface” when needed– Constants, repeated operands, common subexpressions

• so TOP and Swap instructions are required

• Code density is about equal to that of GPR instruction sets– Registers have short addresses

– Keep things in registers and reuse them

• Slightly simpler to write a poor compiler, but not an optimizing compiler

• So, why JAVA?

Page 33: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 33© Alvin R. Lebeck 2001

Variable format, 2 and 3 address instruction

• 32-bit word size, 16 GPR (four reserved)

• Rich set of addressing modes (apply to any operand)

• Rich set of operations

– bit field, stack, call, case, loop, string, poly, system

• Rich set of data types (B, W, L, Q, O, F, D, G, H)

• Condition codes

VAX-11

OpCode A/M A/M A/M

Byte 0 1 n m

Page 34: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 34© Alvin R. Lebeck 2001

Kinds of Addressing Modes

• Register direct Ri

• Immediate (literal) v

• Direct (absolute) M[v]

• Register indirect M[Ri]

• Base+Displacement M[Ri + v]

• Base+Index M[Ri + Rj]

• Scaled Index M[Ri + Rj*d + v]

• Autoincrement M[Ri++]

• Autodecrement M[Ri--]

• Memory Indirect M[M[Ri]]

Ri Rj vmemory

reg. file

Page 35: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 35© Alvin R. Lebeck 2001

A "Typical" RISC

• 32-bit fixed format instruction (3 formats)

• 32 64-bit GPR (R0 contains zero)

• 3-address, reg-reg arithmetic instruction

• Single address mode for load/store: base + displacement– no indirection

• Simple branch conditions

• Delayed branch

see: SPARC, MIPS, MC88100, AMD2900, i960, i860 PARisc, POWERPC, DEC Alpha, Clipper, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3

Page 36: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 36© Alvin R. Lebeck 2001

Example: MIPS (like DLX)

Op

31 26 01516202125

Rs1 Rd immediate

Op

31 26 025

Op

31 26 01516202125

Rs1 Rs2

target

Rd Opx

Register-Register

561011

Register-Immediate

Op

31 26 01516202125

Rs1 Rs2/Opx immediate

Branch

Jump / Call

Page 37: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 37© Alvin R. Lebeck 2001

Next Time

• Data path design

• Pipelining

• Homework #1 Due Sept 11

Page 38: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 38© Alvin R. Lebeck 2001

Review: Execution Time and Amdahl’s Law

ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced

Speedupoverall =ExTimeold

ExTimenew

Speedupenhanced

=

1

(1 - Fractionenhanced) + Fractionenhanced

Speedupenhanced

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

Page 39: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 39© Alvin R. Lebeck 2001

Review: How to Summarize Performance

• Arithmetic mean (weighted arithmetic mean) tracks execution time: (Ti)/n or (Wi*Ti)

• Harmonic mean (weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time: n/ (1/Ri) or 1/ (Wi/Ri)

• Normalized execution time is handy for scaling performance

• But do not take the arithmetic mean of normalized execution time, use the geometric mean ( (Ri)^1/n)

Page 40: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPS 220 40© Alvin R. Lebeck 2001

Review: Performance Evaluation

• Benchmarks (toy,synthetic, kernels, full applications)

• Games

• Mistakes

• Influence of making the sale

Page 41: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

41© Alvin R. Lebeck 2001

Review: Integrated Circuit Costs

dies TestArea Die2

diameter wafer

Area Die2

diameter wafer

waferper Dies

Yield Die Waferper Dies

Cost WaferDieCost

Yield Test Final

Cost Packaging Cost Testing Cost Die Cost IC

2

area Die area unit per Defects

1 yield Wafer Yield Die

Die Cost goes roughly with die area4