Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... ·...

45
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Precourse in Computer Architecture Fundamentals 2. Firmware Level

Transcript of Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... ·...

Page 1: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Master Program (Laurea Magistrale) in Computer Science and Networking

High Performance Computing Systems and Enabling Platforms

Marco Vanneschi

Precourse in Computer Architecture Fundamentals

2.

Firmware Level

Page 2: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Firmware level

• At this level, the system is viewed as a collection of cooperating modules calledPROCESSING UNITS (simply: units).

• Each Processing Unit is

– autonomous

• has its own control, i.e. it has self-control capability, i.e. it is an active computational entity

– described by a sequential program, called microprogram.

• Cooperation is realized through COMMUNICATIONS

– Communication channels implemented by physical links and a firmware protocol.

• Parallelism among units.

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 2

U1

Uj

Un

U2

. . . . . .

. . .

Control Part j

Operating Part j

Page 3: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Examples

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 3

Cache Memory

.

.

.

Main Memory

MemoryManagement

Unit

ProcessorInterrupt Arbiter

I/O1 I/Ok. . .

. . .

CPU

A uniprocessor architecture

In turn, CPU is a collection of

communicating units (on the

same chip)

Cache Memory

In turn, the Cache Memory can be

implemented as a collection of

parallel units: instruction and data

cache, primary – secondary –

tertiary… cache

In turn, the Main Memory can be implemented as a

collection of parallel units: modular memory,

interleaved memory, buffer memory, or …

In turn, the Processor can be

implemented as a (large)

collection of parallel, pipelined

units

In turn, an advanced

I/O unit can be

a parallel architecture

(vector / SIMD

coprocessor / GPU /

…),

or a powerful

multiprocessor-based

network interface,

or ….

Page 4: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Firmware interpretation of assembly language

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 4

Hardware

Applications

Processes

Assembler

Firmware

Uniprocessor : Instruction Level Parallelism

Shared Memory Multiprocessor: SMP, NUMA,…

Distributed Memory : Cluster, MPP, …

• The true architectural level.• Microcoded interpretation of assembler

instructions.• This level is fundamental (it cannot be

eliminated).

• Physical resources: combinatorial and sequential circuit components, physical links

I

I

The firmware interpretation of any assembly

language instructions has a decentralized

organization, even for simple uniprocessor

systems.

The interpreter functionalities are partitioned

among the various processing units (belonging to

CPU, M, I/O subsystems), each unit having its

own Control Part and its own microprogram.

Cooperation of units through explicit message

passing .

Page 5: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Processing Unit model: Control + Operation

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 5

PC and PO are automata,implemented by synchronous sequential networks (sequential circuits)

Clock cycle: time interval t for executing a (any) microinstruction. Synchronous model of computation.

Externaloutputs

Externalinputs

Control Part

(PC)

Operating Part

(PO)

Condition variables

{x}

Control variables

g = {a} {b}

Clock signal

Page 6: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Example of a simple unit design

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 6

Specification IN: OUT = number_of_occurrences (IN, A)

Pseudo-code of interpreter:while (true) do

B = IN; C = 0;for (J = 0; J < N; J ++)

if ( A[ J ] == B) then C = C + 1;OUT = C

A[N]

IN (stream)

32 32

U

OUT (stream)

Integer array: implemented as a registerRAM/ROM

Page 7: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Approach to processing unit design

• PO structure is composed of

– Registers: to implement variables

– Independent registers,

– Register array: ROM/RAM memory

– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and

few others

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 7

Page 8: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Synchronized registers

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 8

when clock do R_output = R_input

R_input

R_output = R_present_state

In PO, the clock signal is AND-ed with a control variable (b), in oder to enable register-writingexplicitly by PC.

Register_writing

RR_output = R_present_state

clock (pulse signal)

R_input

Register output and state remain STABLE duringa clock cycle

Page 9: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 9

BbB

IN

CbC

KCaC

0 Alu1

JbJ

KJaJ

0 Alu2

OUTbOUT

C

x0 = J0

ALU1 : -, +1

aAlu2

KALU1aAlu1

outA C

Alu1

ALU2 : +1

Alu2

J

A [N]outA

Jm

ZbZ

x1 = Z

Jm

Example of PO structure

Arithmetic-logic

combinatorial

circuit

Register array

(RAM/ROM), N words,

32 bits/word

Register (32 bits)

Multiplexer

to PC

from PC

B

Write-enable signal, from PC

1 bit register

Page 10: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Approach to processing unit design

• PO structure is composed of

– Registers: to implement variables

– Independent registers,

– Register array: ROM/RAM memory

– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and

few others

• The microprogram is expressed by means of register-transfer micro-operations.

– Example: A + B C

– i.e., the contents of registers A and B are added, and the result is written into register C.

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 10

Page 11: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 11

BbB

IN

CbC

KCaC

0 Alu1

JbJ

KJaJ

0 Alu2

OUTbOUT

C

x0 = J0

ALU1 : -, +1

aAlu2

KALU1aAlu1

outA C

Alu1

ALU2 : +1

Alu2

J

A [N]outA

Jm

ZbZ

x1 = Z

Jm

Example of micro-operation

Data pathfor the execution of

zero (A[J] – B) Z

B

= 0

= 1

= 0

= 0 = 0 = 0

= 1

= - = -

Executed in oneclock cycle:all the interestedcombinatorialcircuits becomestable during the clock cycle.

Page 12: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Approach to processing unit design

• PO structure is composed of

– Registers: to implement variables

– Independent registers,

– Register array: ROM/RAM memory

– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and

few others

• The microprogram is expressed by means of register-transfer micro-operations.

– Example: A + B C

– i.e., the contents of registers A and B are added, and the results is written into register C.

• Parallelism inside micro-operations. – Example: A + B C, D – E D

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 12

Page 13: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 13

BbB

IN

CbC

KCaC

0 Alu1

JbJ

KJaJ

0 Alu2

OUTbOUT

C

x0 = J0

ALU1 : -, +1

aAlu2

KALU1aAlu1

outA C

Alu1

ALU2 : +1

Alu2

J

A [N]outA

Jm

ZbZ

x1 = Z

Jm

Data pathfor the execution of

J + 1 J, C + 1 C

B

= 1

= 0

= 1

= 1

= 1

= 1

= 0

= 0

= 0

Example of parallel micro-operation

These register-transferoperations are logicallyindependent, and theycan be actuallyexecuted in parallel ifproper PO resourcesare provided (e.g., twoALUs).Both operations are executed during the same clock cycle.

Page 14: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Approach to processing unit design

• PO structure is composed of

– Registers: to implement variables

– Independent registers,

– Register array: ROM/RAM memory

– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and

few others

• The microprogram is expressed by means of register-transfer micro-operations. Example: A + B C

– i.e., the contents of registers A and B are added, and the results is written into register C.

• Parallelism inside micro-operations. Example: A + B C, D – E F

• Conditional microistructions: if … then … else, switch case …

• PC: at each clock cycle interact with PO in order to control the execution ofthe current microinstriction.

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 14

Page 15: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Example of microistruction

i. if (Asign = 0)

then A + B A, J + 1 J, goto j

else A – B B, goto k

j. …

k. …

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 15

Currentmicroistructionlabel

Logical conditionon one or more conditionvariables(Asign denotes the most significantbit of register A)

(Parallel) micro-operation

Label of nextmicroistruction

Control Part (PC) is anautomaton, whose state graphcorresponds to the structure ofthe microprogram:

i

j

k

Asign = 0: execute A + B A, J + 1 J in PO, and goto j

Asign = 1: execute A - B B in PO, and goto k

Page 16: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Approach to processing unit design

• PO structure is composed of– Registers: to implement variables

– Independent registers,

– Register array: ROM/RAM memory

– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and few others

• The microprogram is expressed by means of register-transfer micro-operations. Example: A + B C– i.e., the contents of registers A and B are added, and the results is written into register C.

• Parallelism inside micro-operations. Example: A + B C, D – E F

• Conditional microistructions: if … then … else, switch case …

• PC: at each clock cycle interact with PO in order to control the execution ofthe current microinstriction.

• Each microinstruction is executed in a clock cycle.

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 16

Page 17: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Generation of propervalues ofcontrolvariables

Clock cycle: informal meaning

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 17

PC

PO

pulse width

(stabilization of register

contents)

{values of control variables

clock frequency f = 1/t e.g. t = 0,25 nsec f = 4 GHz

tExecution ofA + B C(A, B, C: registers ofPO)

Execution ofinput_C = A + B

Execution ofA + B C, D + 1 Din parallel in thesame clock cycle

Execution ofinput_D = D + 1

C = input_CD = input_D

Clock cycle, t = maximum stabilitation delay time for the execution of a (any) microinstruction, i.e. for the stabilization of the inputs of all the PC and PO registers

input_C C

Register C

Page 18: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Clock cycle: formally

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 18

TwPC

PC

PO

TwPOTsPO

TsPC

d pulse width

{x}

{g}

clock cycle t = TwPO + max (TwPC + TsPO, TsPC) + d

wA: output function ofautomata A

sA: state transitionfunction of automata A

TF: stabilization delay ofthe combinatorialnetwork implementingfunction F

Page 19: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Example of a simple unit design

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 19

Specification IN: OUT = number_of_occurrences (IN, A)

Pseudo-code of interpreter:while (true) do

B = IN; C = 0;for (J = 0; J < N; J ++)

if ( A[ J ] == B) then C = C + 1;OUT = C

Microprogram (= interpreter implementation)

0. IN B, 0 C, 0 J, goto 1

1. if (Jmost = 0) then zero (A[Jmodule] – B) Z, goto 2

else C OUT, goto 0

2. if (Z = 0) then J + 1 J, goto 1

else J + 1 J, C + 1 C, goto 1

// to be completed with synchronized communications //

A[N]

IN (stream)

32 32

U

Service time of U: Tcalc = k t = ( 2 + 2 N ) t 2N t

N times

N times

1 time

0 1 2

1 time

Control Part state graph (automaton):

Jmost = 0

Jmost = 1

OUT (stream)

Integer array: implemented as a registerRAM/ROM

Page 20: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 20

BbB

IN

CbC

KCaC

0 Alu1

JbJ

KJaJ

0 Alu2

OUTbOUT

C

x0 = J0

ALU1 : -, +1

aAlu2

KALU1aAlu1

outA C

Alu1

ALU2 : +1

Alu2

J

A [N]outA

Jm

ZbZ

x1 = Z

Jm

PO structure

Arithmetic-logic

combinatorial

circuit

Register array

(RAM/ROM), N words,

32 bits/word

Register (32 bits)

Multiplexer

to PC

from PC

B

Write-enable signal, from PC

1 bit register

Page 21: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 21

BbB

IN

CbC

KCaC

0 Alu1

JbJ

KJaJ

0 Alu2

OUTbOUT

C

x0 = J0

ALU1 : -, +1

aAlu2

KALU1aAlu1

outA C

Alu1

ALU2 : +1

Alu2

J

A [N]outA

Jm

ZbZ

x1 = Z

Jm

PO structure

Data pathfor the execution of

J + 1 J, C + 1 C

B

= 1

= 0

= 1

= 1

= 1

= 1

= 0

= 0

= 0

Page 22: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 22

BbB

IN

CbC

KCaC

0 Alu1

JbJ

KJaJ

0 Alu2

OUTbOUT

C

x0 = J0

ALU1 : -, +1

aAlu2

KALU1aAlu1

outA C

Alu1

ALU2 : +1

Alu2

J

A [N]outA

Jm

ZbZ

x1 = Z

Jm

PO structure

Data pathfor the execution of

zero (A[J] – B) Z

B

= 0

= 1

= 0

= 0 = 0 = 0

= 1

= - = -

Page 23: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Service time, and other performance parameters

• Computations on streams

• Stream: (possibly infinite) sequence of values with the same type, also called tasks

• Service time: average time interval between the beginning of processing of twoconsecutive tasks, i.e. between two consecutive services

• Latency: average duration of a single task processing (from input acquisition to resultgeneration) : calculation time

– Service time is equal to latency for sequential computations only (a single module,

or unit)

• Processing Bandwidth = 1/service_time: average number of processed tasks per second (e.g. images per second, operations per second, instructions per second, bitstransmitted per second, …) – also called throughput for generic applications, or performance for CPUs (instructions or float operations per second)

• Completion time, for a finite lenght (m) stream: time interval needed to fully processall the tasks. For large m and parallel computations, often the following approximaterelation holds:

completion_time m service_time

equality relation is rigorous for sequential computationsMCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 23

Page 24: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Communications at the firmware level

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 24

Dedicated links (one-to-one, unidirectional)

Shared link: bus (many-to-many, bidirectional)It can support one-to-one, one-to-many, and many-to-one communications.

In general, for computer structures : parallel links(e.g. one or more wordsper link)

Valid for• uniprocessor computer, • multiprocessor with

shared memory, • some multicomputers

with distributed memory(MPP)

Not valid for• distributed architectures

with standard communication networks,

• some kind of simple I/O.

Page 25: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Basic firmware structure for dedicated links

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 25

Usource Udestination

Output

interfaceInput interface

LINK

Usource::

while (true) do

< produce message MSG >;

< send MSG to Udestination >

Udestination::

while (true) do

< receive MSG from Usource >;

< process MSG >

Page 26: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Ready – acknowledgement protocol

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 26

RDY line (1 bit)

ACK line (1 bit)

U1

O

U

TU2

I

NMSG (n bit)

< send MSG > ::

wait ACK;

write MSG into OUT, signal RDY,

< receive MSG >::

wait RDY;

utilize IN, signal ACK, …

Asynchronous communication : asynchrony degree = 1

Page 27: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Implementation

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 27

U1

OUT U2

I

N

ACK

(1 bit)

RDYOUT

ACKOUT

RDYIN

ACKIN

MSG (n bit)

bRDYOUT

bACKOUT

bRDYIN

bACKIN

RDY

(1 bit)

< send MSG > ::

wait until ACKOUT;

MSG OUT, set RDYOUT, reset ACKOUT

< receive MSG >::

wait until RDYIN;

f( IN, …) …… , set ACKIN, reset RDYIN

Level-transition interface : detects a leveltransition on input line, provided that it occurswhen S = Y

clock

RDY

line exor

RDYIN

S

Y

Module-2 counter

set X: complements X content (a level transition is produced onto output link)reset X: forces X output to zero (by forcing element Y to become equal to S)

Page 28: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Example of a simple unit design withcommunication

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 28

Specification IN: OUT = number_of_occurrences (IN, A)

Pseudo-code of interpreter:while (true) do

receive (IN, B); C = 0;for (J = 0; J < N; J ++)

if ( A[ J ] == B) then C = C + 1;send (OUT, C)

A[N]

IN

32 32

U

OUT

RDYIN RDYOUT

ACKIN ACKOUT

Microprogram (= interpreter implementation)

0. if (RDYIN = 0) then nop, goto 0

else reset RDYIN, set ACKIN, IN B, 0 C, 0 J, goto 1

1. case (Jmost ACKOUT = 0 - ) do zero (A[Jmodule] – B) Z, goto 2

(Jmost ACKOUT = 1 0 ) do nop, goto 1

(Jmost ACKOUT = 1 1 ) do reset ACKOUT, set RDYOUT, C OUT, goto 0

2. if (Z = 0) then J + 1 J, goto 1

else J + 1 J, C + 1 C, goto 1

Page 29: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Impact of communication protocol on performance parameters

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 29

1 time

N times

N times

1 time

0 1 2

Control Part state graph (automaton):

Jmost = 0

Jmost = 1 and ACKOUT = 1

Jmost = 1 and ACKOUT = 0

RDYIN = 0

RDYIN = 1

Synchronization implies waitconditions (wait RDYIN, waitACKOUT), corresponding to arcswith the same source and destination node (stable internalstates) in the state graph.

• In the evaluation of the internal calculation time of U, these wait conditions are not to be taken into account.• That is, we are interested in evaluating the processing capabilities of U

independently of the current situation of the partner units: as if the source and destination units were always ready to communicate.

• On the other hand, the communication protocol has “no cost”, i.e. no overhead.• Thus, the calculation time of U is again Tcalc = 2Nt.

• However, communication might have impact on the service time.

Page 30: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Communication: performance parameters

• Because of communication impact, we distinguish between:

– Ideal service time = internal calculation time

– Effective service time, or simply service time

• Overhead of RDY-ACK protocol at the firmware level, for dedicatedlinks: null for calculation time (ideal service time) !

– all the required operations for RDY-ACK protocol are executed in a single

clock cycle, in parallel to operations of the true computation.

• Communication latency: time needed to complete a communication, i.e. for the ACK reception.

• Service time: considering the effect of communication, it dependson the calculation time and on the communication latency:

– once executed a send operation, the sender can execute a new send operation

after the communication latency time interval (not before).

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 30

Page 31: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Communication latencyand impact on service time

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 31

ACKRDY RDY

Ttrt

Tcalc Tcom

Sender

Receiver

Clock cycle for calculation only

Clock cycle for calculation

and communication

Communication Latency

Lcom = 2 (Ttr + t)

Transmission Latency(Link only)

Communication time NOT overlapped to (i.e. not masked by) internal calculation

Calculation time

Service time T = Tcalc + Tcom

Tcalc ≥ Lcom Tcom = 0

Lcom

T

Page 32: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

“Coarse grain” and “fine grain” computations

• Two possible situations: Tcalc Lcom or Tcalc < Lcom

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 32

Tcalc

Lcom

Tcalc Lcom

Tcom = 0Service time T = Tcalc

Tcalc

Lcom

Tcalc < Lcom

Tcom > 0Service time T = Lcom

(Coarse Grain)

(Fine Grain)

T = Tcalc + Tcom = max (Tcalc, Lcom)

Page 33: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

In the example

Tcalc = 2Nt

Let: Ttr = 9 t

Then: Lcom = 2 (t + Ttr) = 20 t

For N 10: Tcalc Lcom

In this case:

Tcom = 0, service time T = Tcalc = 2Nt

i.e. “coarse grain” calculation (wrt communications)

However, for relatively low N, provided that N 10, it is a “fine grain”

computation (e.g. few tens clock cycles) able to fully mask communications.

For N < 10: Tcalc < Lcom

Tcom > 0, service time T = Lcom = 20t

i.e. “too fine” grain calculation.

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 33

Page 34: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Communication latency and impact on service time

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 34

Comm. Latency Lcom = 2 (Ttr + t)

Service time T = Tcalc + TcomTcalc ≥ Lcom Tcom = 0

efficiency e 1,

ideal processing bandwidth (throughput)

• Impact of Ttr on service time T is significant

• e.g. memory access time = response time of “external” memory to processor

requests (“external” = outside CPU-chip)

• Inside the same chip: Ttr 0,

• e.g. between processor – MMU – cache

• except for on-chip interconnection structures with “long wires” (e.g. buses,

rings, meshes, …)

• With Ttr 0, very “fine grain” computations are possible without communicationpenalty on service time: Lcom = 2t if Tcalc ≥ 2t then T = Tcalc

• Important application in Instruction Level Parallelism processors and in Interconnection Network switch-units.

Page 35: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Request-reply communication

Example: a simple cooperation between a processor Pand a memory unit M

(read or write request from P to M, reply from M to P)

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 35

t

tM

Ttr

P-M request-reply latency: ta = 2 (t + Ttr) + tM

P

M

Memory access time (as “seen” by the processor)

M

P

Page 36: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Integrated hardware technologies

• Integration on silicon semiconductor CHIPS

• Chip area vs clock cycle:– Each logical HW component takes up a given area on silicon (currently: order of fractions of

nanometers).

– Increasing the number of components per chip, i.e. increasing chip area, is positive providedthat components are intensively “packed”.

– Relatively long links on chip are the main source of inefficiency.

– Some macro-components are more suitable for intensive packing (notably: memories), othersare very critical (e.g. large combinatorial networks realized by AND-OR gates).

– Pin count problem: increasing the number of interfaces increases the chip perimeter linearly, thus the chip area increases quadratically. Many interfaces (e.g. over 102 -103 pins) are motivated only if the area increase is properly exploited by a quadratic increase in intensivelypacked HW compenents.

• Technology of CPUs vs technology of memories and links:– CPU clock cycle: t 100 nsec (frequency: few GHz)

– Static Main Memory clock cycle: tM 101 – 102 t

– Dynamic Main Memory clock cycle: tM 102 – 104 t

– Inter-chip link transmission latency: Ttr 100 – 102 t

– Thus, ta is (one or) more order of magnitude greater than the CPU clock cycle.

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 36

Page 37: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

The processor-memory gap (Von Neumann bottleneck)

• For every (known) technology, the ratio ta/t is increasing at every technological improvement– the frequency increase of CPU clock is greater than the frequency increase of

memory clock and of links (inverse of unit latency of links).

• This is the central problem of computer architecture(yesterday, today, tomorrow)– SOLUTIONS ARE BASED ON MEMORY HIERACHIES AND ON PARALLELISM

• “Moore law” (the CPU frequency doubles every 1.5 years) hasbeen valid until 2004 (we’ll see why). Memory and linkstechnologies has been “stopped” too.– Multi-core evolution / revolution.

– The Moore law is now applied to the number of cores (CPUs) in multi-coreand many-core chips.

– What about programming ? One main motivation of SPA and SPM courses.

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 37

Page 38: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Interconnection structuresbased on dedicated links: first examples

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 38

E

E

E

U1

U2

U3

U4

C

C

C

input

stream

output

stream

Trees:Inteconnection of a single input stream and of a single output stream to a collection of nindependent units.

Communication Latency = O(lg n)

Rings /Tori: Communication Latency = O(n)

U1 U2 U3 U4

Page 39: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 39

Interconnection structuresbased on dedicated links: first examples

Meshes:

U U U

U U U

U U U

Communication Latency = O(n1/2)

possibly toroidal

Page 40: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Interconnection structuresbased on dedicated links• In general, all the most interesting interconnection structures for

– shared memory multiprocessors

– distributed memory multicomputer, cluster, MPP

are “limited degree” networks– based on dedicated links

– with limited number of links per unit: CHIP COUNT problem

• Examples (Course Part 2):– n-dimension cubes

– n-dimensione butterflys

– n-dimension Trees / Fat Trees

– rings and multi-rings

• Currently the “best” commercial networks belong to these classes:– Infiniband

– Myrinet

– …

• Trend: on-chip optical interconnection networks

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 40

Page 41: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

(Traditional, old-style) Bus

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 41

Sender behaviour Receiver behaviour

Shared, bidirectionallink.At most one messageat the time can betransmitted.

Arbitrationmechanisms must beprovided.

Serious problems ofbandwidth(contention) for bus-based parallelstructures.

Page 42: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Arbiter mechanisms (for bus writing)

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 42

Independent Request

Page 43: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Arbiters

• Daisy Chaining

• Polling (similar)

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 43

Page 44: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Decentralized arbiters

• Token ring

• Non-deterministic with collision detection

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 44

Page 45: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism ...

Firmware: summary

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 45

Hardware

Applications

Processes

Assembler

Firmware

Uniprocessor : Instruction Level Parallelism

Shared Memory Multiprocessor: SMP, NUMA,…

Distributed Memory : Cluster, MPP, …

• The true architectural level.• Microcoded interpretation of assembler

instructions.• This level is fundamental (it cannot be

eliminated).

• Physical resources: combinatorial and sequential circuit components, physical links

I

I

• Decentralized control of firmware interpreter

• Synchronous computation model

• Clock cycle

• Parallelism inside microinstructions

• Clear cost model for calculation and for

communication

• Efficient communications can be implemented

on dedicated links

• Communication-internal calculation

overlapping