Performance Tuning by Dijesh P

Faculty Development Programme

on

Performance Tuning

Dijesh P

27 July 2012

Quantitative Principles of Computer Design

• The most important principle of computer design is to make the common case fast.

• That is Favor the frequent case over the infrequent case.

• Improving the frequent event over the infrequent event will help improving performance.

• Amdahl’s law can be used to quantify this principle.

Amdahl’s Law

• States that “the performance improvement to be gained from using some faster mode of execution is limited by the time the faster mode can be used”.

• Amdahl’s law defines “Speedup” that can be gained by using a particular feature.

• We can make an enhancement to a machine that will improve the performance when it is used.

• Speedup = – Performance for entire task using enhancement when

possible

– Performance for entire task without using enhancement

– Alternatively,– Speedup =

• Execution time for entire task without using enhancement

• Execution time for entire task using enhancement when possible

• Speedup tells us how much faster a task will run using the m/c with the enhancement, as opposed by the original m/c.

• Amdahl’s law gives a quick way to find the speedup from some enhancement, which depends on two factors.

• The fraction of computation time in the original machine that can be converted to take advantage of the enhancement. (Fraction will always be less than or equal to 1).

• The improvement gained by the enhanced mode of execution; that is how much faster the task would run if the enhanced mode were used for the entire program.

• Execution timenew =Execution timeold X

• (1 – Fractionenhanced) + Fractionenhanced

•

• Speedupenhanced

• The overall speedup is the ratio of the execution times.

• Speedupoverall = Execution timeold

Execution timenew

= 1

(1 - Fractionenhanced) + Fractionenhanced

Speedupenhanced

The CPU performance Equation

• All computers are constructed using a clock running at a constant rate.

• These time events are called ticks, clock ticks, clock periods, clocks, cycles, or clock cycles.

• Designers refer to the time of a clock period by its duration (Eg. 1 ns) or by its rate (Eg. 1 GHz).

• CPU time for a program can be expressed in two ways.– CPU time = ( CPU clock cycles for a program )

X ( Clock cycle time ) .

– OR

– CPU time = CPU clock cycles for a program

Clock rate

• We can also count the number of instruction executed – Instruction count (IC)

• IF we know the number of clock cycles and the instruction count, we can calculate the average number of clock cycles per instructions (CPI).

• CPI = CPU clock cycles for a program

Instruction Count

• CPU time = Instrn Count X Clock cycle time X Cycles per instrn.

• CPU time is dependent on three characteristics:– Clock Cycles (Clock rate) H/W technology

and organization– Clock cycles per instrn Organization and

ISA.– Instrn Count ISA and Compiler technology.

Principle of Locality

• Programs tend to reuse data and instructions they have used recently.– Temporal Locality Recently accessed items

are likely to be accessed in the near future.– Spatial Locality Items whose addresses are

near one another tend to be referenced close together in time.

• We can predict what instructions and data a program will use in the near future based on its accesses in the recent past.

Instruction Set Architecture

• The type of internal storage in a processor is the basic difference.

• The major choices are: Stack, Accumulator (AC) or a set of registers.

• The operands in a stack architecture are implicitly on the top of the stack.

• In an AC architecture, one operand is implicitly the AC.

• The General purpose register architecture have only explicit operands – either registers or memory locations.

• Consider the instruction C = A + B.• Stack Accumulator Register Register

• (Reg – Mem) (Load-store)

• PUSH A LOAD A LOAD R1, A LOAD R1,A

• PUSH B ADD B ADD R3,R1,B LOAD R2,B

• Add STORE C STORE R3,C ADD R3,R1,R2

• POP C STORE R3,C

• There are two classes of register computers :Register – Memory architecture and Register- Register architecture.

• Reg-mem architecture can access memory as part of any instruction, and the other can access memory only with load and store instruction.

• A third class is also there, not used now a days – Memory-Memory architecture. (Keeps all the operands in memory)

• Some ISA have more registers than a single accumulator, but places restriction on uses of these special purpose registers.

• Such an architecture is called extended accumulator or special – purpose register computer.

• Almost all the computers now a days are based on load-store register based. There are two reasons for this. (Registers are very fast and Compilers can efficiently use registers).

• Registers can be used to hold variables.

• When variables are allocated to registers, the memory traffic reduces, and the program speeds up.

• Two major concern in the ISA are:– Whether an ALU instruction has two or three

operands.– How many of the operands may be memory

addressed in ALU instruction.

• This divides the GPR architecture into different sub-categories.

No. of Mem Max no of Architecture Eg.Addresses operands

0 3 Reg-Mem MIPS, ARM, PowerPC

1 2 Reg-Mem IBM 360/370 Intel 80x86

2 2 Mem-Mem VAX

3 3 Mem-Mem VAXVAX is a 32-bit computing architecture that supports

virtual addressing. It was developed in the mid-1970s by Digital Equipment Corporation (DEC). DEC was later purchased by Compaq, which in turn was purchased by Hewlett-Packard.

Register- Register

• Advantages:– Simple, fixed length instructions.– Simple Code generation model– Instructions take similar number of clock

cycles.

• Disadvantages:– Higher instruction count than memory ref.– More instructions leads to larger programs.

Register - Memory

• Advantages:– Data can be accessed without a separate load

instruction first.– Instruction format can be easily encoded.

• Disadvantages:– Operands are not equal.– Restriction on the number of registers. (Due to

encoding a register number and a memory address in each instruction)

– Clocks per instruction vary.

Memory – Memory(2,2) or (3,3)

• Advantages:– Most compact.– Doesn’t waste registers for temporary data.

• Disadvantages:– Large variation in instruction size (three

operand instruction)– Large variation in work per instruction.– Memory access creates memory bottleneck.

Memory Addressing

• An architecture should specify, how memory addresses are interpreted, irrespective of whether the architecture is register-register.

• The measurement presented here are largely computer independent.

• In some cases the measurements are affected by the compiler technology.

Interpreting Memory Addresses

• All instruction sets are assumed to be byte addressed and provide access for bytes (8 bits), half words (16 bits), words (32 bits), and most computers provide access for double words (64 bits).

• There are two conventions for ordering the bytes within a large object: Little endian and Big endian.

• Little endian byte order put the byte whose address is “x….x000” at the least-significant position in the double word.

• The bytes are numbered

• Big endian byte order puts the byte whose address is “x….x000” at the most-significant position in the double word.

• The bytes are numbered

7 6 5 4 3 2 1 0

0 1 2 3 4 5 6 7

• Little endian ordering fails to match normal ordering of words when strings are compared.

• Strings appear “SDRAWKCAB” in the reg.

• Access to objects larger than a byte must be aligned.

• An access to an object of size s bytes at byte address A is aligned if A mod s = 0.

Issues in memory interpreting

• Misalignment causes hardware complications, since memory is usually aligned on a multiple of word or double word boundary.

• A misaligned memory access may take multiple aligned memory references.

• In computers that allow misaligned access, programs with aligned access run faster.

Addressing modes

• If an address is given, memory can be accessed.

• Addressing modes specify constants and registers in additions to locations in memory.

• When a memory location is used, the actual memory address specified by the addressing mode is called effective address.

Categories

• Addressing Mode Eg. Instrn Meaning• Register Add R4, R3 Reg[R4]

Reg[R4]+Reg[R3]

• Immediate Add R4, #3 R4 R4+3

• Displacement Add R4, 100(R1) R4 R4+Mem [100+Reg [R1]]

• Register Indirect Add R4, (R1) Reg [R4] Reg [R4] +Mem [Reg [R1]]

• Indexed Add R3, (R1+R2) Reg [R3] Reg [R3]+Mem [ Reg [R1] + Reg

[ R2]]

• Direct Add R1, (1001) Reg [ R1] Reg [ R1] + Mem [1001]

• Mem Indirect Add R1, (R3) Reg [R1] Reg [R1]+ Mem [ Mem [ Reg [ R3]]]

• Autoincrement Add R1, (R2)+ Reg[R1] Reg[R1]

+Mem[Reg[R2]]

Reg[R2] Reg[R2] + d

• Autodecrement Add R1,-(R2) Reg[R2] Reg[R2] – d

Reg[R1] Reg[R1] +

Mem[Reg[R2]]• Scaled Add R1,100(R2)[R3]

Reg[R1] Reg[R1]+Mem[100+Reg[R2] + Reg[R3] *d]

Usage of different addressing modes

• Register When a value is in register.

• Immediate For Constants

• Displacement Accessing Local variables.

• Reg Indirect Accessing using a pointer or an address.

• Indexed Array addressing. (R1= base of array and R2=index amount)

• Direct Sometimes useful for accessing static data.

• Mem Indirect If R3 is the address of a pointer p, then mode yields *p;

• Autoincrement Stepping through arrays within a loop. R2 points to start of an array; each reference increment R2 by size of an element d.

Operations in the Instruction Set

• Operator Type Example• Arithmetic and logical add, subtract, and, or• Data transfer Loads – stores• Control Branch, jump, procedure call• System OS call, VM mgt instructions• Floating point add, multiply, divide, compare• Decimal add, multiply, dec–char

conversion• String move, search, compare• Graphics pixel and vertex operations,

compression, decompression

Instructions for control flow

• Four different types of control flow changes:

– Conditional branches– Jumps– Procedure calls– Procedure returns

Encoding an instruction set

• Different factors affect how the instructions are encoded into a binary representation.

• The representation affects the size of the compiled program and the implementation of the processor (which decode the rep to quickly find the operations and operands).

• Operation is specified in one field called opcode.• Important is how to encode the addressing

modes with the operations.

• This depends on the range of addressing modes.

• Some older computers have one to five operands with 10 addressing modes for each operand.

• For such large number of combinations, a separate address specifier is needed for each operand.

• When encoding an instruction, the no of registers and no of addressing modes both have an impact on the size of instruction.

Competing forces – instruction encoding

1. Desire to have as many registers and addressing modes as possible.

2. Impact of size of the register and addressing mode fields on the average instruction size. (!!! Hence the average of program size)

3. Desire to have instructions encoded into lengths that will be easy to handle in a pipelined implementation.

• Three choices for encoding an instruction set are:– Fixed Combines the operation and the

addressing mode into the opcode. Have only a single size for all instructions.

– Variable Allows all addressing modes to be with all operations. This style is best when there are many addressing modes and operations.

– Hybrid Has multiple formats.

Thank You

Performance Tuning by Dijesh P

Education

Transcript of Performance Tuning by Dijesh P