1/16/99CS520S99 IntroductionC. Edward Chow Page 1 Why study computer architecture? To learn the...

1/16/99 CS520S99 Introduction

C. Edward

Chow

Why study computer architecture? To learn the principles for designing processors and

systems To learn the system configuration trade-off

what size of caches/memory is enough what kind of buses to connect system components what size (speed) of disks to use

To choose a computer for a set of applications in a project. To interpret the benchmark figures given by salespersons. To decide which processor chips to use in a system To design the system software (compiler, OS) for a new

processor? To be the leader of a processor design team? To learn several machine’s assembly languages?


C. Edward

Chow

The Basic Structure of a Computer


C. Edward

Chow

Control and Data Flow in Processor

Processor is made up of Data operator (Arithmetic and Logic Unit, ALU)—D

consumes and combines information into a new meaning Control—K

evokes operations of other components


C. Edward

Chow

Control is often distributed


C. Edward

Chow

Instruction Execution at Register Transfer Level (RTL)

• Consider the detailed execution of the instruction “move &100, %d0” (Moving constant 100 to register d0)

• Assume the instruction was loaded into memory location 1000

• The op code of the move instruction and the register address d0 are encoded in byte1000 and 1001

• The constant 100 in byte 1002 and 1003.


C. Edward

Chow

RTL Instruction Execution• Mpc is set to 1000 pointing at instruction in the meory• Step 1: Mmar = Mpc; // put pc into mar; prepare to fetch instruction.

1000


C. Edward

Chow

Update Program Counter• Step 2: Mpc = Mpc+4; // update program counter;

move Mpc value to D, D perform +4, move result back to Mpc

1000+2

1000

1002


C. Edward

Chow

Instruction Fetch• Step 3: Mir = Mp[Mmar]; // fetch instruction

send Mmar value to Mp, Mp retrieve move|d0, send back to Mir

Steps3 and 2 can be donein parallel.

1000

Move|d0100


C. Edward

Chow

Instruction Decoding• Step 4: Decode Instruction in Mir

Move|d0100


C. Edward

Chow

RTL Instruction Execution• Step 5: Mgeneral[0] = Mp[Mir16-31];// execute the move of

the constant into a general register named d0

Move|d0100

100Subscript 16-31 denotesthe 16th and 31th bits

containing constant 100


C. Edward

Chow

Computer Architecture

The term “computer architecture” was coined by IBM in 1964 for use with IBM 360. Amdahl, Blaauw, and Brooks [1964] used the term to refer to the programmer-visible portion of the instruction set. They believe that a family of machines of the same architecture should be able to run the same software.

Benefits:• With a precise defined architecture, we can have

many compatible implementations.• The program written in the same instruction set can

run in all the compatible implementations.


C. Edward

Chow

Architecture & Implementation

• Single Architecture—multiple implementation computer family

• Multiple Architecture—single implementation microcode emulator


C. Edward

Chow

Computer Architecture Topics

Instruction Set Architecture

Pipelining, Hazard Resolution,Superscalar, Reordering, Prediction, Speculation,Vector, DSP

Addressing,Protection,Exception Handling

L1 Cache

L2 Cache

DRAM

Disks, WORM, Tape

Coherence,Bandwidth,Latency

Emerging TechnologiesInterleavingBus protocols

RAID

VLSI

Input/Output and Storage

MemoryHierarchy

Pipelining and Instruction Level Parallelism

Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB


C. Edward

Chow

Computer Architecture Topics

M

Interconnection NetworkS

PMPMPMP° ° °

Topologies,Routing,Bandwidth,Latency,Reliability

Network Interfaces

Shared Memory,Message Passing,Data Parallelism

Processor-Memory-Switch

MultiprocessorsNetworks and Interconnections



C. Edward

Chow

CS 520 Course Focus

Understanding the design techniques, machine structures, technology factors, evaluation methods that will determine the form of computers in 21st Century

Technology ProgrammingLanguages

OperatingSystems History

ApplicationsInterface Design

(ISA)

Measurement & Evaluation

Parallelism

Computer Architecture:• Instruction Set Design• Organization• Hardware



C. Edward

Chow

Function Requirements faced by a computer designer

• Applications – general purpose

balanced performance for a range of tasks– Scientific

high performance floating points– Commercial

support for COBOL (decimal arithmetic)database/transaction processing

• Level of software compatibility– Object code/binary level

no software porting, more hw design cost– Programming Lang. Level

avoid old architecture burden, require software porting


C. Edward

Chow

Function Requirements faced by a computer designer

• Operating System Requirements– Size of address space– Memory management/Protection

(e.g. garbage collection vs. realtime scheduling)– Interrupt/traps

• Standards– Floating Point (IEEE754)– I/O Bus– OS– Networks– Programming Languages


C. Edward

Chow

1988 Computer Food Chain

PCWork-stationMini-

computer

Mainframe

Mini-supercomputer

Supercomputer

Massively Parallel Processors



C. Edward

Chow

1998 Computer Food Chain

PCWork-station

Mainframe

Supercomputer

Mini-supercomputerMassively Parallel Processors

Mini-computer

Now who is eating whom?

Server



C. Edward

Chow

Why Such Change in 10 years?• Performance

– Technology Advances• CMOS VLSI dominates older technologies (TTL, ECL) in cost AND

performance

– Computer architecture advances improves low-end • RISC, superscalar, RAID, …

• Price: Lower costs due to …– Simpler development

• CMOS VLSI: smaller systems, fewer components

– Higher volumes• CMOS VLSI : same dev. cost 10,000 vs. 10,000,000 units

– Lower margins by class of computer, due to fewer services• Function

– Rise of networking/local interconnection technology



C. Edward

Chow

Year

Transistors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000

i80386

i4004

i8080

Pentium

i80486

i80286

i8086

Technology Trends: Microprocessor Capacity

CMOS improvements:• Die size: 2X every 3 yrs• Line width: halve / 7 yrs

“Graduation Window”

Alpha 21264: 15 millionPentium Pro: 5.5 millionPowerPC 620: 6.9 millionAlpha 21164: 9.3 millionSparc Ultra: 5.2 million

Moore’s Law



C. Edward

Chow

Memory Capacity (Single Chip DRAM)size

Year

Bits

1000

10000

100000

1000000

10000000

100000000

1000000000

1970 1975 1980 1985 1990 1995 2000

year size(Mb) cycle time

1980 0.0625 250 ns

1983 0.25 220 ns

1986 1 190 ns

1989 4 165 ns

1992 16 145 ns

1996 64 120 ns

2000 256 100 ns



C. Edward

Chow

Technology Trends(Summary)

Capacity Speed (latency)

Logic 2x in 3 years 2x in 3 years

DRAM 4x in 3 years 2x in 10 years

Disk 4x in 3 years 2x in 10 years



C. Edward

Chow

Processor Performance Trends

Year

0.1

1

10

100

1000

1965 1970 1975 1980 1985 1990 1995 2000

Microprocessors

Minicomputers

Mainframes

Supercomputers



C. Edward

Chow

Processor Performance(1.35X before, 1.55X now)

0

200

400

600

800

1000

1200

87 88 89 90 91 92 93 94 95 96 97

DEC Alpha 21264/600

DEC Alpha 5/500

DEC Alpha 5/300

DEC Alpha 4/266IBM POWER 100

DEC AXP/500

HP 9000/750

Sun-4/260

IBMRS/6000

MIPS M/120

MIPS M

2000

1.54X/yr



C. Edward

Chow

Performance Trends(Summary)

• Workstation performance (measured in Spec Marks) improves roughly 50% per year (2X every 18 months)

• Improvement in cost performance estimated at 70% per year



C. Edward

Chow

Computer Engineering Methodology

TechnologyTrends



C. Edward

Chow


Evaluate ExistingEvaluate ExistingSystems for Systems for BottlenecksBottlenecks

TechnologyTrends

Benchmarks



C. Edward

Chow



Simulate NewSimulate NewDesigns andDesigns and

OrganizationsOrganizations

TechnologyTrends

Benchmarks

WorkloadsAdapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB


C. Edward

Chow



Simulate NewSimulate NewDesigns andDesigns and

OrganizationsOrganizations

Implement NextImplement NextGeneration SystemGeneration System

TechnologyTrends

Benchmarks

Workloads

ImplementationComplexity



C. Edward

Chow

Measurement and Evaluation

Design

Analysis

Architecture is an iterative process:• Searching the space of possible designs• At all levels of computer systems

Creativity

Good IdeasGood Ideas

Mediocre IdeasBad Ideas

Cost /PerformanceAnalysis



C. Edward

Chow

Measurement Tools

• Benchmarks, Traces, Mixes• Hardware: Cost, delay, area, power estimation• Simulation (many levels)

– ISA, RT, Gate, Circuit• Queuing Theory• Rules of Thumb• Fundamental “Laws”/Principles


C. Edward

Chow

Metric of Computer Architecture

• Space measured in bits of representation • Time measures in bit traffic (memory bandwidth)Many old frequency and benchmark studies focus on• dynamic opcode (memory size concern)• exponent differences of floating point operands (precision)• length of decimal numbers in business files (memory size)Trend: space is not much a concern; speed/time is everything.• Here we focus more on the following two performance metrics• Response time = time between start and finish of an event

— execution time— latency

• Throughput = total amount of work done in a given time— bandwidth (no. of bits or bytes moved per second)


C. Edward

Chow

Metrics of Performance at Different Levels

Compiler

Programming Language

Application

DatapathControl

Transistors Wires Pins

ISA

Function Units

(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s

Cycles per second (clock rate)

Megabytes per second

Answers per monthOperations per second



C. Edward

Chow

Quantitative principles

Improve means• increase performance• decrease execution time

“X is n% faster than Y”

Quantitative principles• Make the common case fast

— Amdahl’s Law• Locality of reference

— 90% of execution time in 10% of code

1001

n

ximeExecutionT

yimeExecutionT


C. Edward

Chow

Amdahl’s LawLaw of diminishing returns

ancementimeWithEnhExecutionT

tEnhancemenimeWithoutExecutionTSpeedup

enhancedModSpeedupOfE

deEnhancedMoFractionIndeEnhancedMoFractionInTimeTime oldnew )1(

enhancedModSpeedupOfEdeEnhancedMoFractionIn

deEnhancedMoFractionInTime

TimeSpeedup

new

old

)1(

1

50

FractionInEnhancedMode=0.5 based on old systemSpeedupOfEnhancedMode=2

50 50 25Timeold Timenew


C. Edward

Chow

Amdahl’s Law Result

10020.99

101.90.9

3.331.50.7

21.330.5

1.41.150.3

1.11.050.1

OverallSpeedup When SpeedupOfEnhancedMode=

OverallSpeedup When SpeedupOfEnhancedMode=2

FractionIn Enhancedmode


C. Edward

Chow

Apply Amdahl’s Law: Example 1

Example1: Assume that the memory access accounts for 90% of the execution time. What is the speedup by replacing a 100ns memory with a 10ns memory? How much fast is the new system?

Answer:FractionInEnhancedMode = 90%=0.9SpeedupOfEnhancedMode = 100ns/10ns = 10

The new system is 426% faster than the old one.Is it worthwhile if the high speed memory costs 10 times more?

100

426126.5

19.0

1

09.01.0

1

109.0

1.0

1

rallSpeedupOve


C. Edward

Chow

Apply Amdahl’s Law: Example 2

Example 2: Assume that 40% of the time is spent on CPU task; the rest is spent on I/O. Assume we improve CPU and keep I/O speed unchanged.

a) How much faster should new CPU be to have the overall speedup of 1.5?

b) Is that possible to have an overall speedup of 2? Why?

Solution:

a) x=6. 500% faster

b) The maximum overall speedup that can be achieved is

Therefore, it is not possible to achieve the overall speedup of 2.

x4.0

)4.01(

15.1

66.14.01

1


C. Edward

Chow

Apply Amdahl’s Law: Example 3Example: A recent research on the bottleneck of a 10Mbps Ethernet

network system showed that only 10% of the execution time of a distributed application was spent on transmitting messages and 90% of the time was on application/ protocol software execution at hosts’ computers. If we replace Ethernet with 100 Mbps FDDI, 900% faster than Ethernet, what will be speedup of this improvement? What if we use 900% faster hosts?


C. Edward

Chow

Excution TimeThe first performance metric and the best metric.Measure the time it takes to execute the intended application(s) or the

typical workload. The time command can measure an application.vlsia[93]: time ts9217.1u 27.2s 8:16 49% 0+27552k 6+3io 26pf+0wHere is an example which shows how OS and I/O impact the

execution time.

For program 1,Elapsed Time = sum(t1:t11)-t6-t8 System CPU time = t1+t3+t5+t9+t11

CPU time = t1 + t3 + t4 + t5 + t9+t10 User CPU time = t4 + t10


C. Edward

Chow

CPU Time

CPI=(Clock cycles Per Instruction); Ii is the frequency of instruction i in a program; IC=Instruction Count.; ClockCycleTime=1/ClockRate

CPI figure gives insight into different styles of instruction sets & implementations.

Interdependence among instruction count, CPI, and Clock rateClock rate—Hardware technology and organizationCPI—Organization and instruction set architectureInstruction count—Instruction set architecture and compiler technology

We cannot measure the performance of a computer by single factor above alone.


C. Edward

Chow

Evaluating Instruction Set Design

Example Page 39: 1/4 of ALU and Load instructions replaced by new r->m inst. Assume that the clock cycle time is not changed. Is this a good idea?

212.1%New r->m

326.9%224%Braches

213.5%212%Stores

211.4%221%Loads

136.1%143%ALU ops

ClockCycleFrequencyAfter

ClockcycleFrequencyBefore

http://redcloud.uccs.edu/~cs520/doc/evalISD.xls


C. Edward

Chow

Evaluate Instruction Design

CPIold = (0.43*1 + 0.21*2 + 0.12*2 + 0.24*2) = 1.57

CPU timeold = InstructionCountold * 1.57 * ClockCycleTimeold

CPInew==1.908

CPU timenew = (0.893*InstructionCountold) * 1.908 * ClockCycleTimeold

= 1.703 * InstructionCountold * ClockCycleTimeold

With the assumptions, it is a bad idea to add register-memory instructions.

43.0*25.01

3*24.02*12.02*)43.0*25.0(2*)43.0*25.0(21.0(1*))43.0*25.0(43.0(


C. Edward

Chow

Estimate CPU time by (CPIi*InstructionCounti)*ClockCycleTime

Program: f=(a-b)/(c-d*e)MIPS R2000 25MHzInstructions (op dst, src1,

src2) lw $14, 20($sp) lw $15, 16($sp)

subu $24, $14, $15lw $25, 8($sp)lw $8, 4($sp)mul $9, $25, $8lw $10, 12($sp)subu $11, $10, $9div$12, $24, $11sw $12, 0($sp)

IC=InstructionCount=10CPI=ClockcyclesPerInstructionCPIi=ClockcyclesOfInstructionType iIi=number of Instructions of type i

in a prog.ClockCycleTime

=1/ClockRate=1/25*106

=40*10-9sec=40nsec

CPIi can be obtained from processor handbook.

Here we assume no cache misses.


C. Edward

Chow

Estimate CPU time by ClockCycleTime*CPIi*InstructionCounti)

16

221sw5

111div4

111mul3

212subu2

1025lw1

CPIi*ICiCPIiIiCount

Instruction Type

i

CPU Time = 16*40 nsec = 640 nsec


C. Edward

Chow

Other Performance MeasuresThe only reliable measure of performance is the execution time of real

programs. Other attempts:1.

• Depends on instruction set, hard to compare,• MIPS varies with programs on the same computer.Example1: the impact of using Floating Point Hardware on MIPS.Example2: Impact of optimizing compiler usage on MIPS.

What affects performance?

• input• version of programs, compiler, OS, CPU• optimizing level of compiler• machine configurations

— amount of cache, main memory, disks— the speed of cache, main memory, disks, and bus.

66 1010

imeExecutionT

nCountInstructio

CPI

ClockRateMIPS


C. Edward

Chow

Myth of MIPSExample: The effect of optimizing compiler on MIPS number. (Page45)

A machine with the 500MHz clock rate and the following clock cycles for instructions. For a program, the relative frequencies of instructions before and after using an optimizing compiler are as shown in the table.

48248Branches

24224Stores

42242Loads

43186ALU ops

IC After Optimization

CPIiIC Before Optimization

InstructionType

CPI unoptimized = 86/200*1+42/200*2+24/200*2+48/200*2=1.57MIPS unoptimized = 500/(1.57*106)=318.5CPI optimized = 43/157*1+42/157*2+24/157*2+48/157*2=1.73MIPS optimized = 500/(1.73*106)=289.0CPU time unoptimized = 200*1.57*(2*10-9) = 6.28*10-7

CPU time optimized = 157*1.73*(2*10-9) = 5.43*10-7


C. Edward

Chow

MFLOPS

For scientific computing MFLOPS is used as a metric:

Here it emphasizes operations instead of instructions.• Unfortunately, the set of floating-point operations is not

consistent across machines.• The rating changes with different mix ratio of integer-floating or

floating-floating instructions.The solution is to use a canonical number of floating point

operations for certain type of FP operations, e.g. 1 for (add, sub, compare, mul), 4 for (fdiv, fsqrt), 8 for (arctan, sin, exp)

610

Pr

imeExecutionT

ogramationInANoOfFPOperMFLOPS


C. Edward

Chow

Programs to Evaluate Performance

Real programs — The set of programs to be run forms the workload.

Kernels — key pieces of real programs; isolate features of a machines; Livermore Loops (weighted ops); Linpack

Toy Benchmarks — 10 to 100 lines of codes: e.g., quicksort, Sieve, Puzzle

Synthetic Benchmarks — artificially created to match an average execution profile: e.g., Whetstone, Dhrystone

SPEC (System Performance Evaluation Cooperation) Benchmarks 89, 92, 95.

Perfect Club Benchmarks for parallel computations.


C. Edward

Chow

SPEC: System Performance Evaluation Cooperative Benchmark

• First Round 1989: 10 programs yielding a single number (“SPECmarks”)• Second Round 1992: SPECInt92 (6 integer programs) and SPECfp92 (14

floating point programs)– Compiler Flags unlimited. March 93 of DEC 4000 Model 610:

spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=memcpy(b,a,c)”

wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas

• Third Round 1995– new set of programs: SPECint95 (8 integer programs) and SPECfp95

(10 floating point) – “benchmarks useful for 3 years”– Single flag setting for all programs: SPECint_base95, SPECfp_base95

http://www.spec.org/

http://www.spec.org/


C. Edward

Chow

Comparison of Machine PerformanceSingle Program—execution timeCollection of (n) Programs1. Total execution time2. Normalized to a reference machine, compute the TimeRatio of

ith program TimeRatioi=Timei/Timei(ReferenceMachine)

arithmetic mean=

geometric mean=

harmonic mean=

Geometric mean is consistent independent of referenced machine.Harmonic mean decrease impact of outliers.

n

iiTimeRatio

n 1

1

n

n

iiTimeRatio

1

n

iiTimeRatio

n

1


C. Edward

Chow

Summarize Performance ResultsExample: Execution of two programs on three machines. Assume

Program 1 has 10M floating point operations and Program 2 has 50M floating point operations

50/20=2.550/50=150/100=0.5Native MFLOPS on Program 2

Geometric Mean

(0.5+2.5)/2=3(1+1)/2=1(10+0.5)/2=5.25Arithmetic Mean

10/20=0.510/10=110/1=10Native MFLOPS on Program 1

4060101TotalTime(sec)

2050100Program2(sec)

20101Program1(sec)

ComputerCComputerBComputerA

24.25.010 111 12.15.25.0


C. Edward

Chow

Weighted Arithmetic Means

• For a set of n program, each takes Timei on one machine, the “equal-time” weights on that machine

are

Figure 1.12W(3) [W(2)] are equal-time weights based on machineA [B]. This is used in Exercise 1.11

n

jTime

Timei

j

iw

1

1

1a b c w(1) w(2) w(3)

P1(sec) 1 10 20 0.5 0.909 0.999P2(sec) 1000 100 20 0.5 0.091 0.001

AM:W(1) 500.5 55 20

AM:W(2) 91.82 18.18 20

AM:W(3) 1.998 10.09 20

http://redcloud.uccs.edu/~cs520/doc/fig1_12.xls


C. Edward

Chow

Hints for Homework # 1

Exercise 1.7:1. Whetstone consists of integer operations besides the floating-

point operations.2. When floating point processor is not used, all floating-point

operations need to be emulated by integer operations (e.g. shift, and, add, sub, multiply, div...).

3. For different co-fp processors, we will have the same # of integer ops but different # of FP ops.

Exercise 1.11: a. use the equal-time weightings formula in Page 26.b. DEC3000 execution time(ora)

= VAX11 780Time(ora)/ DEC3000SPECRatio=7421/165


C. Edward

Chow

FP Compilation Results depend on existence of FP coprocessor

Exercise 1.7. Whetstone is a benchmark with both Integer and Floating Point (FP) operations.


C. Edward

Chow

Compiling floating-point statement

Here are the generated assembly instructions of a floating-point operation statement in C on DEC3100 (with R2010 floating point unit) using command cc -S

Note that since the R2010 only implements simple floating point add, sub, mult, and div operations, sqrt, exp, and alog are translated as subroutine calls using jal instr. The floating-point division is translated as div.d and will be executed by R2010.

# 7 x=sqrt(exp(alog(x)/t1)); s.d $f4, 48($sp) #load x to fp register f4 l.d $f12, 56($sp) #load t1 to fp register f12 jal alog #call subroutine alog move $16, $2 mtc1 $16, $f6 cvt.d.w $f8, $f6 #f8 contains alog(x) l.d $f10, 48($sp) div.d $f12, $f8, $f10 jal exp mov.d $f20, $f0 mov.d $f12, $f20 jal sqrt s.d $f0, 56($sp)


C. Edward

Chow

Homework #1Problems 1.7 and 1.11Problem A. Program segment: f=(a-b)/(a*b) is compiled into the

following MIPS R2000 code.

Instructions (op dst, src1, src2)

lw $14, 20($sp) # a is allocated at M[sp+20]

lw $15, 16($sp) # b is allocated at M[sp+16]

subu $24, $14, $15

mul $9, $14, $15

div $12, $24, $9

sw $12, 0($sp) # f is allocated at M[sp+0]


C. Edward

Chow

Homework #1 (Continue)Assume all the variables are already in the cache (i.e. does not have

to go the main memory for data) and Table 1 contains the clock cycles for each types of instructions when data is in the cache.

What is the execution time (in term of seconds) of the above segment using a R2000 chip with a 25 MHz clock?

Problem B. Assume the CPU operation accounts for 70% of the time in a system.

a) What is the overall speedup if we improve CPU speed by 100%?

b) How much faster should the new CPU be in order to have the overall speedup of 1.7?

c) Is it possible to have overall speedup of 3 by just improving the CPU?

1/16/99CS520S99 IntroductionC. Edward Chow Page 1 Why study computer architecture? To learn the...

Documents

Transcript of 1/16/99CS520S99 IntroductionC. Edward Chow Page 1 Why study computer architecture? To learn the...