Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 1 ELEC 5200-001/6200-001 Computer...

Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 11

ELEC 5200-001/6200-001ELEC 5200-001/6200-001Computer Architecture and DesignComputer Architecture and Design

Fall 2014Fall 2014 Instruction-Level Parallelism Instruction-Level Parallelism

Vishwani D. AgrawalVishwani D. AgrawalJames J. Danaher ProfessorJames J. Danaher Professor

Department of Electrical and Computer EngineeringDepartment of Electrical and Computer EngineeringAuburn University, Auburn, AL 36849Auburn University, Auburn, AL 36849

http://www.eng.auburn.edu/[email protected]


A Computer SystemA Computer System

Processor

Cache

Main memory

I/O controller I/O controller I/O controller

Disk DiskGraphics output Network

Memory – I/O bus

Interrupts


Advanced Architectures – ILPAdvanced Architectures – ILPInstruction level parallelism (ILP): multiple Instruction level parallelism (ILP): multiple instructions fetched and executed simultaneously.instructions fetched and executed simultaneously.ILP is used in addition to pipelining.ILP is used in addition to pipelining.Processors with ILP are called Processors with ILP are called multiple-issue multiple-issue processors – processors – multiple instructions launched in 1 multiple instructions launched in 1 clock cycle. Two ways:clock cycle. Two ways:– MIMD: Multiple Instructions Multiple DataMIMD: Multiple Instructions Multiple Data

SuperpipelineSuperpipelineSuperscalar – dynamic multiple issueSuperscalar – dynamic multiple issueVery long instruction word (VLIW) – static multiple issueVery long instruction word (VLIW) – static multiple issue

– SIMD: Single Instruction Multiple DataSIMD: Single Instruction Multiple DataVector processorVector processor


Superpipeline and SuperscalarSuperpipeline and SuperscalarIF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

System clock cycles

Pipeline

1 instruction/cycle

Superpipeline(Pipeline clock is twice asfast as the system clock)

2 instructions per cycle

Superscalar

2 (or more) instructions/cycle

0 1 2 3 4 5 6 7 8

A Static Two-Issue MIPS PipelineA Static Two-Issue MIPS Pipeline

Read two instructions per cycle:Read two instructions per cycle:An ALU or branch instruction, andAn ALU or branch instruction, and

A load or store instructionA load or store instruction

Insert one nop if above pair is not availableInsert one nop if above pair is not available

Added hardware (Figure 4.69, page 336):Added hardware (Figure 4.69, page 336):A second instruction memoryA second instruction memory

Additional input/output ports in register fileAdditional input/output ports in register file

Additional ALU in execute stage for address Additional ALU in execute stage for address calculationcalculation


An Example (Page 337)An Example (Page 337)


Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0(s1)addi $s1, $s1, – 4bne $s1, $0, Loop

Static Two-Issue ExecutionStatic Two-Issue Execution

ALU or branch instruction

Data transfer instruction

Clock cycle

Loop: nop lw $t0, 0($s1) 1

addi $s1, $s1, – 4 nop 2

addu $t0, $t0, $s2 nop 3

bne $s1, $0, Loop sw $t0, 4($s1) 4


Note code reordering and change in sw argument.

CPI = 4/5 = 0.8 < 0.5 (ideal)

Loop Unrolling (Index Multiple of 4)Loop Unrolling (Index Multiple of 4)ALU or branch

instructionData transfer instruction

Clock cycle

Loop: addi $s1, $s1, – 16 lw $t0, 0($s1) 1

nop lw $t1, 12($s1) 2

addu $t0, $t0, $s2 lw $t2, 8($s1) 3

addu $t1, $t1, $s2 lw $t3, 4($s1) 4

addu $t2, $t2, $s2 sw $t0, 16($s1) 5

addu $t3, $t3, $s2 sw $t1, 12($s1) 6

nop sw $t2, 8($s1) 7

bne $s1, $0, Loop sw $t3, 4($s1) 8


CPI = 8/14 = 0.57 < 0.5 (ideal)


VLIW: Very Long Instruction WordVLIW: Very Long Instruction Word

Static multiple issue, ILP determined by compiler.Static multiple issue, ILP determined by compiler.Datapath contains multiple execution units.Datapath contains multiple execution units.Compiler groups instructions that have no data or resource Compiler groups instructions that have no data or resource conflicts for parallel execution.conflicts for parallel execution.Grouped instructions are packed in very long words of a Grouped instructions are packed in very long words of a wide instruction memory.wide instruction memory.Speedup benefit of VLIW is highly program dependent.Speedup benefit of VLIW is highly program dependent.J. A. Fisher, “Very Long Instruction Word Architecture and J. A. Fisher, “Very Long Instruction Word Architecture and ELI-512,” ELI-512,” Proc. 10Proc. 10thth Symp. on Computer Architecture Symp. on Computer Architecture, , Stockholm, June 1983, pp. 478-490.Stockholm, June 1983, pp. 478-490.J. A. Fisher, P. Faraboschi and C. Young, Embedded J. A. Fisher, P. Faraboschi and C. Young, Embedded Computing: Computing: A VLIW Approach to Architecture, Compilers A VLIW Approach to Architecture, Compilers and Tools, and Tools, Morgan Kaufmann.Morgan Kaufmann.


Superscalar: Dynamic Scheduling Superscalar: Dynamic Scheduling and Out-of-Order Executionand Out-of-Order Execution

Instruction fetch and decode unit

Reservation station

Reservation station

Reservation station

Reservation station

Commit unit

integer integerFloating

pointLoad/ store

Functional units

In-order issue

Out-of-order execution

In-order commit

Out-of-order issue

Out of Order Execution (OOE)Out of Order Execution (OOE)

A procedural programming language A procedural programming language sequences instructions.sequences instructions.

Sequencing assumes an order of Sequencing assumes an order of execution – no parallelism.execution – no parallelism.

OOE must preserve correctness of result.OOE must preserve correctness of result.

Principle: Two instructions can be Principle: Two instructions can be executes in parallel if they do not have executes in parallel if they do not have dependences.dependences.


RAW DependenceRAW Dependence

Read after write (RAW): A dependent Read after write (RAW): A dependent instruction reads from a register being instruction reads from a register being written to by another instruction.written to by another instruction.

Example:Example:

addadd $s1, $s2, $s3$s1, $s2, $s3

subsub $s2, $s1, $s3$s2, $s1, $s3

sub has RAW dependence on addsub has RAW dependence on add


WAR DependenceWAR Dependence

Write after read (WAR): A dependent Write after read (WAR): A dependent instruction writes to a register being read instruction writes to a register being read by another instruction.by another instruction.

Example:Example:

addadd $s1, $s2, $s3$s1, $s2, $s3

subsub $s2, $s1, $s3$s2, $s1, $s3

sub has WAR dependence on addsub has WAR dependence on add


WAW DependenceWAW Dependence

Read after write (RAW): One instruction Read after write (RAW): One instruction writes to a register to being written to by writes to a register to being written to by another instruction.another instruction.

Example:Example:

addadd $s2, $s2, $s3$s2, $s2, $s3

subsub $s2, $s1, $s3$s2, $s1, $s3

sub has WAW dependence on addsub has WAW dependence on add


Superscalar Instruction IssueSuperscalar Instruction IssueRules:Rules:

RAW dependence – If any operand is being written, do RAW dependence – If any operand is being written, do not issue.not issue.

WAR dependence – If the result register is being read, WAR dependence – If the result register is being read, do not issue.do not issue.

WAW dependence – If the result register is being WAW dependence – If the result register is being written, do not issue.written, do not issue.

Scoreboard: Scoreboard: Cycle by cycle record of registers and execution units Cycle by cycle record of registers and execution units showing how many instructions are using them.showing how many instructions are using them.

Example 1: In-order issue (next 2 slides).Example 1: In-order issue (next 2 slides).

Example 2: Out-of-order issue (3Example 2: Out-of-order issue (3rdrd slide). slide).


Dynamic SchedulingDynamic Scheduling

Consider an example:Consider an example:First with in-order issueFirst with in-order issue

Then with out-of-order issueThen with out-of-order issue

Assume:Assume:Up to two instructions are fetched in a cycleUp to two instructions are fetched in a cycle

Instruction register can hold two instructionsInstruction register can hold two instructions

An Instruction is issued in decode cycle, or must wait An Instruction is issued in decode cycle, or must wait until there is no RAW, WAR or WAW dependenceuntil there is no RAW, WAR or WAW dependence

An instruction can retire two or three cycles after it is An instruction can retire two or three cycles after it is issuedissued



Ckcycle

Inst

# DecodedIssueInst#

RetireInst#

Reg. to read Reg. to write

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

1 12

R3 = R0 * R1R4 = R0 + R2

12

12

11 1

11 1

2 34

R5 = R0 + R1R6 = R1 + R4

3-

33

22

11

11

11

11

3 3 2 1 1 1 1

4 123

21

11

1 1 11

55 R7 = R1 * R2

45

12 1

11

11 1

6 6 R1 = R0 – R2 - 2 1 1 1 1

7 4 1 1 1

8 5

97 R3 = R3 * R1

6-

11

11

11


Ckcycle

Instr

# DecodedIssueInst#

RetireInst#


0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

10 1 1 1

11 6

128 R1 = R4 + R4

7-

11

11

11

13 1 1 1

14 1 1 1

15 7

16 8 2 1

17 2 1

18 8

In-order Issue scoreboard (Continued)

Out-of-order scoreboard (Next 2 Slides)

Questions?Questions?RAW dependence: Inst# 4 (R6 = R1 + R4) could RAW dependence: Inst# 4 (R6 = R1 + R4) could not be issued until cycle 5. Should Inst# 5 (R7 = R1 not be issued until cycle 5. Should Inst# 5 (R7 = R1 * R2) wait in queue?* R2) wait in queue?

Answer: No. Inst# 5 can be issued in cycle 3 as Answer: No. Inst# 5 can be issued in cycle 3 as there is no register conflict there is no register conflict (out-of-order issue).(out-of-order issue).

WAR dependence: Must the issue of Inst#6 (R1 = WAR dependence: Must the issue of Inst#6 (R1 = R0 – R2) waits until cycle 9 when all instructions R0 – R2) waits until cycle 9 when all instructions reading R1 have retired?reading R1 have retired?

Answer: No. Provided new result of Inst#6 does not Answer: No. Provided new result of Inst#6 does not affect R1 being used by previous instructions affect R1 being used by previous instructions (register renaming).(register renaming).



Ckcycle

Inst#

DecodedIssueInst#

RetireInst#


0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

1 12

R3 = R0 * R1R4 = R0 + R2

12

12

11 1

11 1

2 34

R5 = R0 + R1R6 = R1 + R4

3-

33

22

11

11

11

11

3 56

R7 = R1 * R2S1 = R0 – R2

56

2

343

333

232

111

11

111

111

478

R3 = R3 * S1S2 = R4 + R4

4-8

13

33321

44432

22222

11333

111

1111

11111

11111

5 6 2 1 3 1 1 1

6 7458

21

11

1111

322

111

1111

1 11

7 1 1

8 1 1

9 7

ReferencesReferences

Previous example is from:Previous example is from:A. S. Tanenbaum, A. S. Tanenbaum, Structured Computer Structured Computer Organization, Fifth EditionOrganization, Fifth Edition, Prentice-Hall, 2006, pp. , Prentice-Hall, 2006, pp. 304-309, Section 4.5.3.304-309, Section 4.5.3.

Further reading:Further reading:D. W. Anderson, F. J. Sparacio and R. M. D. W. Anderson, F. J. Sparacio and R. M. Tomasulo, “The IBM 360 Model 91: Processor Tomasulo, “The IBM 360 Model 91: Processor Philosophy and Instruction Handling,” Philosophy and Instruction Handling,” IBM J. Res. IBM J. Res. & Dev.& Dev., vol. 11, no. 1, pp. 8-24, Jan. 1967., vol. 11, no. 1, pp. 8-24, Jan. 1967.



Power Reduction by Slack SchedulingPower Reduction by Slack SchedulingApplication: Superscalar, out-of-order execution:Application: Superscalar, out-of-order execution:

An instruction is executed as soon as the required data and An instruction is executed as soon as the required data and resources become available.resources become available.

A commit unit reorders the results.A commit unit reorders the results.

Delay the completion of instructions whose result Delay the completion of instructions whose result is not immediately needed.is not immediately needed.

Example of RISC instructions:Example of RISC instructions: addadd r0, r1, r2;r0, r1, r2; (A)(A)

sub sub r3, r4, r5;r3, r4, r5; (B)(B)

and and r9, r1, r9;r9, r1, r9; (C)(C)

or or r5, r9, r10;r5, r9, r10; (D)(D)

xor xor r2, r10, r11;r2, r10, r11; (E)(E)

J. Casmira and D. Grunwald,“Dynamic Instruction SchedulingSlack,” Proc. ACM Kool ChipsWorkshop, Dec. 2000.


Slack Scheduling ExampleSlack Scheduling Example

Slack schedulingSlack scheduling

AABB CC

DD

EE

Standard schedulingStandard scheduling

AA BB CC

DD

EE


Slack SchedulingSlack Scheduling

Slack bitLow-power

execution units(Reduced voltage)

Re-order buffer

Sch

edul

ing

logi

c


Superscalar Design of P4 (CISC)Superscalar Design of P4 (CISC)CISC shell:CISC shell:– Processor fetches instructions from memory in the Processor fetches instructions from memory in the

order of static program.order of static program.– Each instruction is translated into one or more fixed-Each instruction is translated into one or more fixed-

length RISC instructions, known as micro-operations length RISC instructions, known as micro-operations (micro-ops).(micro-ops).

RISC core:RISC core:– Micro-ops are executed out-of-order in a dynamically Micro-ops are executed out-of-order in a dynamically

scheduled pipeline.scheduled pipeline.– Processor commits the result of each micro-op Processor commits the result of each micro-op

execution to register file in the order of original execution to register file in the order of original program flow. program flow.


SuperscalarsSuperscalars3 or more instruction issues per clock:3 or more instruction issues per clock:

Intel P6Intel P6AMD K5AMD K5Sun UltraSPARCSun UltraSPARCAlpha 21164Alpha 21164MIPS R10000MIPS R10000PowerPC 604/620PowerPC 604/620HP 8000HP 8000

References:References:D. W. Anderson, F. J. Sparacio and R. M. Tomasulo, “The IBM D. W. Anderson, F. J. Sparacio and R. M. Tomasulo, “The IBM 360 Model 91: Processor Philosophy and Instruction Handling,” 360 Model 91: Processor Philosophy and Instruction Handling,” IBM J. Res. DevIBM J. Res. Dev., vol. 11, pp. 8-24, January 1967.., vol. 11, pp. 8-24, January 1967.

T. Agerwala and J. Cocke, “Reduced Instruction Set T. Agerwala and J. Cocke, “Reduced Instruction Set Processors,” Technical Report RC12434 (#55845), Yorktown Processors,” Technical Report RC12434 (#55845), Yorktown Heights, NY: IBM T. J. Watson Research Center, January 1987.Heights, NY: IBM T. J. Watson Research Center, January 1987.


Topics in Computer ArchitectureTopics in Computer ArchitectureInstruction setInstruction setProgram execution through register transfer Program execution through register transfer See Lectures 13-14. Computer arithmetic (2’s complement, See Lectures 13-14. Computer arithmetic (2’s complement, IEEE 754 floating point standard, addition, multiplication)IEEE 754 floating point standard, addition, multiplication)Datapaths (single-cycle, multicycle, pipeline)Datapaths (single-cycle, multicycle, pipeline)Control (combinational logic, FSM, microcode)Control (combinational logic, FSM, microcode)Pipelining (throughput, hazards, forwarding, stall, branch Pipelining (throughput, hazards, forwarding, stall, branch prediction)prediction)Memory organization (cache, virtual memory)Memory organization (cache, virtual memory)Performance (benchmarks, energy efficiency, Amdal’s law)Performance (benchmarks, energy efficiency, Amdal’s law)Advanced architectures (ILP, OOE, superscalar, etc.)Advanced architectures (ILP, OOE, superscalar, etc.)Not discussed in this course:Not discussed in this course:– MultiprocessorsMultiprocessors– Compiler and software techniques – loop unrolling, trace execution, etc.Compiler and software techniques – loop unrolling, trace execution, etc.– Input and outputInput and output– Power managementPower management


One who claims to know much about computer

architecture speaks from ignorance . . . because

a lot is going to happen in the future, which is . . .

http://www.youtube.com/watch?v=xZbKHDPPrrc

Doris Day in Hitchcock’s 1956 Movie

“The Man Who Knew Too Much”

Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 1 ELEC 5200-001/6200-001 Computer...

Documents

Transcript of Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 1 ELEC 5200-001/6200-001 Computer...