1/16/99CS520S99 IntroductionC. Edward Chow Page 1 Why study computer architecture? To learn the...
-
Upload
samuel-sherriff -
Category
Documents
-
view
214 -
download
0
Transcript of 1/16/99CS520S99 IntroductionC. Edward Chow Page 1 Why study computer architecture? To learn the...
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 1
Why study computer architecture? To learn the principles for designing processors and
systems To learn the system configuration trade-off
what size of caches/memory is enough what kind of buses to connect system components what size (speed) of disks to use
To choose a computer for a set of applications in a project. To interpret the benchmark figures given by salespersons. To decide which processor chips to use in a system To design the system software (compiler, OS) for a new
processor? To be the leader of a processor design team? To learn several machine’s assembly languages?
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 2
The Basic Structure of a Computer
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 3
Control and Data Flow in Processor
Processor is made up of Data operator (Arithmetic and Logic Unit, ALU)—D
consumes and combines information into a new meaning Control—K
evokes operations of other components
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 4
Control is often distributed
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 5
Instruction Execution at Register Transfer Level (RTL)
• Consider the detailed execution of the instruction “move &100, %d0” (Moving constant 100 to register d0)
• Assume the instruction was loaded into memory location 1000
• The op code of the move instruction and the register address d0 are encoded in byte1000 and 1001
• The constant 100 in byte 1002 and 1003.
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 6
RTL Instruction Execution• Mpc is set to 1000 pointing at instruction in the meory• Step 1: Mmar = Mpc; // put pc into mar; prepare to fetch instruction.
1000
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 7
Update Program Counter• Step 2: Mpc = Mpc+4; // update program counter;
move Mpc value to D, D perform +4, move result back to Mpc
1000+2
1000
1002
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 8
Instruction Fetch• Step 3: Mir = Mp[Mmar]; // fetch instruction
send Mmar value to Mp, Mp retrieve move|d0, send back to Mir
Steps3 and 2 can be donein parallel.
1000
Move|d0100
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 9
Instruction Decoding• Step 4: Decode Instruction in Mir
Move|d0100
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 10
RTL Instruction Execution• Step 5: Mgeneral[0] = Mp[Mir16-31];// execute the move of
the constant into a general register named d0
Move|d0100
100Subscript 16-31 denotesthe 16th and 31th bits
containing constant 100
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 11
Computer Architecture
The term “computer architecture” was coined by IBM in 1964 for use with IBM 360. Amdahl, Blaauw, and Brooks [1964] used the term to refer to the programmer-visible portion of the instruction set. They believe that a family of machines of the same architecture should be able to run the same software.
Benefits:• With a precise defined architecture, we can have
many compatible implementations.• The program written in the same instruction set can
run in all the compatible implementations.
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 12
Architecture & Implementation
• Single Architecture—multiple implementation computer family
• Multiple Architecture—single implementation microcode emulator
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 13
Computer Architecture Topics
Instruction Set Architecture
Pipelining, Hazard Resolution,Superscalar, Reordering, Prediction, Speculation,Vector, DSP
Addressing,Protection,Exception Handling
L1 Cache
L2 Cache
DRAM
Disks, WORM, Tape
Coherence,Bandwidth,Latency
Emerging TechnologiesInterleavingBus protocols
RAID
VLSI
Input/Output and Storage
MemoryHierarchy
Pipelining and Instruction Level Parallelism
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 14
Computer Architecture Topics
M
Interconnection NetworkS
PMPMPMP° ° °
Topologies,Routing,Bandwidth,Latency,Reliability
Network Interfaces
Shared Memory,Message Passing,Data Parallelism
Processor-Memory-Switch
MultiprocessorsNetworks and Interconnections
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 15
CS 520 Course Focus
Understanding the design techniques, machine structures, technology factors, evaluation methods that will determine the form of computers in 21st Century
Technology ProgrammingLanguages
OperatingSystems History
ApplicationsInterface Design
(ISA)
Measurement & Evaluation
Parallelism
Computer Architecture:• Instruction Set Design• Organization• Hardware
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 16
Function Requirements faced by a computer designer
• Applications – general purpose
balanced performance for a range of tasks– Scientific
high performance floating points– Commercial
support for COBOL (decimal arithmetic)database/transaction processing
• Level of software compatibility– Object code/binary level
no software porting, more hw design cost– Programming Lang. Level
avoid old architecture burden, require software porting
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 17
Function Requirements faced by a computer designer
• Operating System Requirements– Size of address space– Memory management/Protection
(e.g. garbage collection vs. realtime scheduling)– Interrupt/traps
• Standards– Floating Point (IEEE754)– I/O Bus– OS– Networks– Programming Languages
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 18
1988 Computer Food Chain
PCWork-stationMini-
computer
Mainframe
Mini-supercomputer
Supercomputer
Massively Parallel Processors
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 19
1998 Computer Food Chain
PCWork-station
Mainframe
Supercomputer
Mini-supercomputerMassively Parallel Processors
Mini-computer
Now who is eating whom?
Server
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 20
Why Such Change in 10 years?• Performance
– Technology Advances• CMOS VLSI dominates older technologies (TTL, ECL) in cost AND
performance
– Computer architecture advances improves low-end • RISC, superscalar, RAID, …
• Price: Lower costs due to …– Simpler development
• CMOS VLSI: smaller systems, fewer components
– Higher volumes• CMOS VLSI : same dev. cost 10,000 vs. 10,000,000 units
– Lower margins by class of computer, due to fewer services• Function
– Rise of networking/local interconnection technology
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 21
Year
Transistors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000
i80386
i4004
i8080
Pentium
i80486
i80286
i8086
Technology Trends: Microprocessor Capacity
CMOS improvements:• Die size: 2X every 3 yrs• Line width: halve / 7 yrs
“Graduation Window”
Alpha 21264: 15 millionPentium Pro: 5.5 millionPowerPC 620: 6.9 millionAlpha 21164: 9.3 millionSparc Ultra: 5.2 million
Moore’s Law
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 22
Memory Capacity (Single Chip DRAM)size
Year
Bits
1000
10000
100000
1000000
10000000
100000000
1000000000
1970 1975 1980 1985 1990 1995 2000
year size(Mb) cycle time
1980 0.0625 250 ns
1983 0.25 220 ns
1986 1 190 ns
1989 4 165 ns
1992 16 145 ns
1996 64 120 ns
2000 256 100 ns
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 23
Technology Trends(Summary)
Capacity Speed (latency)
Logic 2x in 3 years 2x in 3 years
DRAM 4x in 3 years 2x in 10 years
Disk 4x in 3 years 2x in 10 years
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 24
Processor Performance Trends
Year
0.1
1
10
100
1000
1965 1970 1975 1980 1985 1990 1995 2000
Microprocessors
Minicomputers
Mainframes
Supercomputers
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 25
Processor Performance(1.35X before, 1.55X now)
0
200
400
600
800
1000
1200
87 88 89 90 91 92 93 94 95 96 97
DEC Alpha 21264/600
DEC Alpha 5/500
DEC Alpha 5/300
DEC Alpha 4/266IBM POWER 100
DEC AXP/500
HP 9000/750
Sun-4/260
IBMRS/6000
MIPS M/120
MIPS M
2000
1.54X/yr
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 26
Performance Trends(Summary)
• Workstation performance (measured in Spec Marks) improves roughly 50% per year (2X every 18 months)
• Improvement in cost performance estimated at 70% per year
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 27
Computer Engineering Methodology
TechnologyTrends
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 28
Computer Engineering Methodology
Evaluate ExistingEvaluate ExistingSystems for Systems for BottlenecksBottlenecks
TechnologyTrends
Benchmarks
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 29
Computer Engineering Methodology
Evaluate ExistingEvaluate ExistingSystems for Systems for BottlenecksBottlenecks
Simulate NewSimulate NewDesigns andDesigns and
OrganizationsOrganizations
TechnologyTrends
Benchmarks
WorkloadsAdapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 30
Computer Engineering Methodology
Evaluate ExistingEvaluate ExistingSystems for Systems for BottlenecksBottlenecks
Simulate NewSimulate NewDesigns andDesigns and
OrganizationsOrganizations
Implement NextImplement NextGeneration SystemGeneration System
TechnologyTrends
Benchmarks
Workloads
ImplementationComplexity
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 31
Measurement and Evaluation
Design
Analysis
Architecture is an iterative process:• Searching the space of possible designs• At all levels of computer systems
Creativity
Good IdeasGood Ideas
Mediocre IdeasBad Ideas
Cost /PerformanceAnalysis
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 32
Measurement Tools
• Benchmarks, Traces, Mixes• Hardware: Cost, delay, area, power estimation• Simulation (many levels)
– ISA, RT, Gate, Circuit• Queuing Theory• Rules of Thumb• Fundamental “Laws”/Principles
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 33
Metric of Computer Architecture
• Space measured in bits of representation • Time measures in bit traffic (memory bandwidth)Many old frequency and benchmark studies focus on• dynamic opcode (memory size concern)• exponent differences of floating point operands (precision)• length of decimal numbers in business files (memory size)Trend: space is not much a concern; speed/time is everything.• Here we focus more on the following two performance metrics• Response time = time between start and finish of an event
— execution time— latency
• Throughput = total amount of work done in a given time— bandwidth (no. of bits or bytes moved per second)
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 34
Metrics of Performance at Different Levels
Compiler
Programming Language
Application
DatapathControl
Transistors Wires Pins
ISA
Function Units
(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s
Cycles per second (clock rate)
Megabytes per second
Answers per monthOperations per second
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 35
Quantitative principles
Improve means• increase performance• decrease execution time
“X is n% faster than Y”
Quantitative principles• Make the common case fast
— Amdahl’s Law• Locality of reference
— 90% of execution time in 10% of code
1001
n
ximeExecutionT
yimeExecutionT
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 36
Amdahl’s LawLaw of diminishing returns
ancementimeWithEnhExecutionT
tEnhancemenimeWithoutExecutionTSpeedup
enhancedModSpeedupOfE
deEnhancedMoFractionIndeEnhancedMoFractionInTimeTime oldnew )1(
enhancedModSpeedupOfEdeEnhancedMoFractionIn
deEnhancedMoFractionInTime
TimeSpeedup
new
old
)1(
1
50
FractionInEnhancedMode=0.5 based on old systemSpeedupOfEnhancedMode=2
50 50 25Timeold Timenew
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 37
Amdahl’s Law Result
10020.99
101.90.9
3.331.50.7
21.330.5
1.41.150.3
1.11.050.1
OverallSpeedup When SpeedupOfEnhancedMode=
OverallSpeedup When SpeedupOfEnhancedMode=2
FractionIn Enhancedmode
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 38
Apply Amdahl’s Law: Example 1
Example1: Assume that the memory access accounts for 90% of the execution time. What is the speedup by replacing a 100ns memory with a 10ns memory? How much fast is the new system?
Answer:FractionInEnhancedMode = 90%=0.9SpeedupOfEnhancedMode = 100ns/10ns = 10
The new system is 426% faster than the old one.Is it worthwhile if the high speed memory costs 10 times more?
100
426126.5
19.0
1
09.01.0
1
109.0
1.0
1
rallSpeedupOve
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 39
Apply Amdahl’s Law: Example 2
Example 2: Assume that 40% of the time is spent on CPU task; the rest is spent on I/O. Assume we improve CPU and keep I/O speed unchanged.
a) How much faster should new CPU be to have the overall speedup of 1.5?
b) Is that possible to have an overall speedup of 2? Why?
Solution:
a) x=6. 500% faster
b) The maximum overall speedup that can be achieved is
Therefore, it is not possible to achieve the overall speedup of 2.
x4.0
)4.01(
15.1
66.14.01
1
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 40
Apply Amdahl’s Law: Example 3Example: A recent research on the bottleneck of a 10Mbps Ethernet
network system showed that only 10% of the execution time of a distributed application was spent on transmitting messages and 90% of the time was on application/ protocol software execution at hosts’ computers. If we replace Ethernet with 100 Mbps FDDI, 900% faster than Ethernet, what will be speedup of this improvement? What if we use 900% faster hosts?
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 41
Excution TimeThe first performance metric and the best metric.Measure the time it takes to execute the intended application(s) or the
typical workload. The time command can measure an application.vlsia[93]: time ts9217.1u 27.2s 8:16 49% 0+27552k 6+3io 26pf+0wHere is an example which shows how OS and I/O impact the
execution time.
For program 1,Elapsed Time = sum(t1:t11)-t6-t8 System CPU time = t1+t3+t5+t9+t11
CPU time = t1 + t3 + t4 + t5 + t9+t10 User CPU time = t4 + t10
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 42
CPU Time
CPI=(Clock cycles Per Instruction); Ii is the frequency of instruction i in a program; IC=Instruction Count.; ClockCycleTime=1/ClockRate
CPI figure gives insight into different styles of instruction sets & implementations.
Interdependence among instruction count, CPI, and Clock rateClock rate—Hardware technology and organizationCPI—Organization and instruction set architectureInstruction count—Instruction set architecture and compiler technology
We cannot measure the performance of a computer by single factor above alone.
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 43
Evaluating Instruction Set Design
Example Page 39: 1/4 of ALU and Load instructions replaced by new r->m inst. Assume that the clock cycle time is not changed. Is this a good idea?
212.1%New r->m
326.9%224%Braches
213.5%212%Stores
211.4%221%Loads
136.1%143%ALU ops
ClockCycleFrequencyAfter
ClockcycleFrequencyBefore
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 44
Evaluate Instruction Design
CPIold = (0.43*1 + 0.21*2 + 0.12*2 + 0.24*2) = 1.57
CPU timeold = InstructionCountold * 1.57 * ClockCycleTimeold
CPInew==1.908
CPU timenew = (0.893*InstructionCountold) * 1.908 * ClockCycleTimeold
= 1.703 * InstructionCountold * ClockCycleTimeold
With the assumptions, it is a bad idea to add register-memory instructions.
43.0*25.01
3*24.02*12.02*)43.0*25.0(2*)43.0*25.0(21.0(1*))43.0*25.0(43.0(
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 45
Estimate CPU time by (CPIi*InstructionCounti)*ClockCycleTime
Program: f=(a-b)/(c-d*e)MIPS R2000 25MHzInstructions (op dst, src1,
src2) lw $14, 20($sp) lw $15, 16($sp)
subu $24, $14, $15lw $25, 8($sp)lw $8, 4($sp)mul $9, $25, $8lw $10, 12($sp)subu $11, $10, $9div$12, $24, $11sw $12, 0($sp)
IC=InstructionCount=10CPI=ClockcyclesPerInstructionCPIi=ClockcyclesOfInstructionType iIi=number of Instructions of type i
in a prog.ClockCycleTime
=1/ClockRate=1/25*106
=40*10-9sec=40nsec
CPIi can be obtained from processor handbook.
Here we assume no cache misses.
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 46
Estimate CPU time by ClockCycleTime*CPIi*InstructionCounti)
16
221sw5
111div4
111mul3
212subu2
1025lw1
CPIi*ICiCPIiIiCount
Instruction Type
i
CPU Time = 16*40 nsec = 640 nsec
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 47
Other Performance MeasuresThe only reliable measure of performance is the execution time of real
programs. Other attempts:1.
• Depends on instruction set, hard to compare,• MIPS varies with programs on the same computer.Example1: the impact of using Floating Point Hardware on MIPS.Example2: Impact of optimizing compiler usage on MIPS.
What affects performance?
• input• version of programs, compiler, OS, CPU• optimizing level of compiler• machine configurations
— amount of cache, main memory, disks— the speed of cache, main memory, disks, and bus.
66 1010
imeExecutionT
nCountInstructio
CPI
ClockRateMIPS
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 48
Myth of MIPSExample: The effect of optimizing compiler on MIPS number. (Page45)
A machine with the 500MHz clock rate and the following clock cycles for instructions. For a program, the relative frequencies of instructions before and after using an optimizing compiler are as shown in the table.
48248Branches
24224Stores
42242Loads
43186ALU ops
IC After Optimization
CPIiIC Before Optimization
InstructionType
CPI unoptimized = 86/200*1+42/200*2+24/200*2+48/200*2=1.57MIPS unoptimized = 500/(1.57*106)=318.5CPI optimized = 43/157*1+42/157*2+24/157*2+48/157*2=1.73MIPS optimized = 500/(1.73*106)=289.0CPU time unoptimized = 200*1.57*(2*10-9) = 6.28*10-7
CPU time optimized = 157*1.73*(2*10-9) = 5.43*10-7
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 49
MFLOPS
For scientific computing MFLOPS is used as a metric:
Here it emphasizes operations instead of instructions.• Unfortunately, the set of floating-point operations is not
consistent across machines.• The rating changes with different mix ratio of integer-floating or
floating-floating instructions.The solution is to use a canonical number of floating point
operations for certain type of FP operations, e.g. 1 for (add, sub, compare, mul), 4 for (fdiv, fsqrt), 8 for (arctan, sin, exp)
610
Pr
imeExecutionT
ogramationInANoOfFPOperMFLOPS
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 50
Programs to Evaluate Performance
Real programs — The set of programs to be run forms the workload.
Kernels — key pieces of real programs; isolate features of a machines; Livermore Loops (weighted ops); Linpack
Toy Benchmarks — 10 to 100 lines of codes: e.g., quicksort, Sieve, Puzzle
Synthetic Benchmarks — artificially created to match an average execution profile: e.g., Whetstone, Dhrystone
SPEC (System Performance Evaluation Cooperation) Benchmarks 89, 92, 95.
Perfect Club Benchmarks for parallel computations.
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 51
SPEC: System Performance Evaluation Cooperative Benchmark
• First Round 1989: 10 programs yielding a single number (“SPECmarks”)• Second Round 1992: SPECInt92 (6 integer programs) and SPECfp92 (14
floating point programs)– Compiler Flags unlimited. March 93 of DEC 4000 Model 610:
spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=memcpy(b,a,c)”
wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas
• Third Round 1995– new set of programs: SPECint95 (8 integer programs) and SPECfp95
(10 floating point) – “benchmarks useful for 3 years”– Single flag setting for all programs: SPECint_base95, SPECfp_base95
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 52
Comparison of Machine PerformanceSingle Program—execution timeCollection of (n) Programs1. Total execution time2. Normalized to a reference machine, compute the TimeRatio of
ith program TimeRatioi=Timei/Timei(ReferenceMachine)
arithmetic mean=
geometric mean=
harmonic mean=
Geometric mean is consistent independent of referenced machine.Harmonic mean decrease impact of outliers.
n
iiTimeRatio
n 1
1
n
n
iiTimeRatio
1
n
iiTimeRatio
n
1
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 53
Summarize Performance ResultsExample: Execution of two programs on three machines. Assume
Program 1 has 10M floating point operations and Program 2 has 50M floating point operations
50/20=2.550/50=150/100=0.5Native MFLOPS on Program 2
Geometric Mean
(0.5+2.5)/2=3(1+1)/2=1(10+0.5)/2=5.25Arithmetic Mean
10/20=0.510/10=110/1=10Native MFLOPS on Program 1
4060101TotalTime(sec)
2050100Program2(sec)
20101Program1(sec)
ComputerCComputerBComputerA
24.25.010 111 12.15.25.0
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 54
Weighted Arithmetic Means
• For a set of n program, each takes Timei on one machine, the “equal-time” weights on that machine
are
Figure 1.12W(3) [W(2)] are equal-time weights based on machineA [B]. This is used in Exercise 1.11
n
jTime
Timei
j
iw
1
1
1a b c w(1) w(2) w(3)
P1(sec) 1 10 20 0.5 0.909 0.999P2(sec) 1000 100 20 0.5 0.091 0.001
AM:W(1) 500.5 55 20
AM:W(2) 91.82 18.18 20
AM:W(3) 1.998 10.09 20
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 55
Hints for Homework # 1
Exercise 1.7:1. Whetstone consists of integer operations besides the floating-
point operations.2. When floating point processor is not used, all floating-point
operations need to be emulated by integer operations (e.g. shift, and, add, sub, multiply, div...).
3. For different co-fp processors, we will have the same # of integer ops but different # of FP ops.
Exercise 1.11: a. use the equal-time weightings formula in Page 26.b. DEC3000 execution time(ora)
= VAX11 780Time(ora)/ DEC3000SPECRatio=7421/165
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 56
FP Compilation Results depend on existence of FP coprocessor
Exercise 1.7. Whetstone is a benchmark with both Integer and Floating Point (FP) operations.
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 57
Compiling floating-point statement
Here are the generated assembly instructions of a floating-point operation statement in C on DEC3100 (with R2010 floating point unit) using command cc -S
Note that since the R2010 only implements simple floating point add, sub, mult, and div operations, sqrt, exp, and alog are translated as subroutine calls using jal instr. The floating-point division is translated as div.d and will be executed by R2010.
# 7 x=sqrt(exp(alog(x)/t1)); s.d $f4, 48($sp) #load x to fp register f4 l.d $f12, 56($sp) #load t1 to fp register f12 jal alog #call subroutine alog move $16, $2 mtc1 $16, $f6 cvt.d.w $f8, $f6 #f8 contains alog(x) l.d $f10, 48($sp) div.d $f12, $f8, $f10 jal exp mov.d $f20, $f0 mov.d $f12, $f20 jal sqrt s.d $f0, 56($sp)
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 58
Homework #1Problems 1.7 and 1.11Problem A. Program segment: f=(a-b)/(a*b) is compiled into the
following MIPS R2000 code.
Instructions (op dst, src1, src2)
lw $14, 20($sp) # a is allocated at M[sp+20]
lw $15, 16($sp) # b is allocated at M[sp+16]
subu $24, $14, $15
mul $9, $14, $15
div $12, $24, $9
sw $12, 0($sp) # f is allocated at M[sp+0]
1/16/99 CS520S99 Introduction
C. Edward
Chow
Page 59
Homework #1 (Continue)Assume all the variables are already in the cache (i.e. does not have
to go the main memory for data) and Table 1 contains the clock cycles for each types of instructions when data is in the cache.
What is the execution time (in term of seconds) of the above segment using a R2000 chip with a 25 MHz clock?
Problem B. Assume the CPU operation accounts for 70% of the time in a system.
a) What is the overall speedup if we improve CPU speed by 100%?
b) How much faster should the new CPU be in order to have the overall speedup of 1.7?
c) Is it possible to have overall speedup of 3 by just improving the CPU?