Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 1 ELEC 5200-001/6200-001 Computer...
-
Upload
aubrey-sullivan -
Category
Documents
-
view
226 -
download
1
Transcript of Fall 2014, Nov 19... ELEC 5200-001/6200-001 Lecture 12 1 ELEC 5200-001/6200-001 Computer...
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 11
ELEC 5200-001/6200-001ELEC 5200-001/6200-001Computer Architecture and DesignComputer Architecture and Design
Fall 2014Fall 2014 Instruction-Level Parallelism Instruction-Level Parallelism
Vishwani D. AgrawalVishwani D. AgrawalJames J. Danaher ProfessorJames J. Danaher Professor
Department of Electrical and Computer EngineeringDepartment of Electrical and Computer EngineeringAuburn University, Auburn, AL 36849Auburn University, Auburn, AL 36849
http://www.eng.auburn.edu/[email protected]
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 22
A Computer SystemA Computer System
Processor
Cache
Main memory
I/O controller I/O controller I/O controller
Disk DiskGraphics output Network
Memory – I/O bus
Interrupts
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 33
Advanced Architectures – ILPAdvanced Architectures – ILPInstruction level parallelism (ILP): multiple Instruction level parallelism (ILP): multiple instructions fetched and executed simultaneously.instructions fetched and executed simultaneously.ILP is used in addition to pipelining.ILP is used in addition to pipelining.Processors with ILP are called Processors with ILP are called multiple-issue multiple-issue processors – processors – multiple instructions launched in 1 multiple instructions launched in 1 clock cycle. Two ways:clock cycle. Two ways:– MIMD: Multiple Instructions Multiple DataMIMD: Multiple Instructions Multiple Data
SuperpipelineSuperpipelineSuperscalar – dynamic multiple issueSuperscalar – dynamic multiple issueVery long instruction word (VLIW) – static multiple issueVery long instruction word (VLIW) – static multiple issue
– SIMD: Single Instruction Multiple DataSIMD: Single Instruction Multiple DataVector processorVector processor
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 44
Superpipeline and SuperscalarSuperpipeline and SuperscalarIF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
System clock cycles
Pipeline
1 instruction/cycle
Superpipeline(Pipeline clock is twice asfast as the system clock)
2 instructions per cycle
Superscalar
2 (or more) instructions/cycle
0 1 2 3 4 5 6 7 8
A Static Two-Issue MIPS PipelineA Static Two-Issue MIPS Pipeline
Read two instructions per cycle:Read two instructions per cycle:An ALU or branch instruction, andAn ALU or branch instruction, and
A load or store instructionA load or store instruction
Insert one nop if above pair is not availableInsert one nop if above pair is not available
Added hardware (Figure 4.69, page 336):Added hardware (Figure 4.69, page 336):A second instruction memoryA second instruction memory
Additional input/output ports in register fileAdditional input/output ports in register file
Additional ALU in execute stage for address Additional ALU in execute stage for address calculationcalculation
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 55
An Example (Page 337)An Example (Page 337)
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 66
Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0(s1)addi $s1, $s1, – 4bne $s1, $0, Loop
Static Two-Issue ExecutionStatic Two-Issue Execution
ALU or branch instruction
Data transfer instruction
Clock cycle
Loop: nop lw $t0, 0($s1) 1
addi $s1, $s1, – 4 nop 2
addu $t0, $t0, $s2 nop 3
bne $s1, $0, Loop sw $t0, 4($s1) 4
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 77
Note code reordering and change in sw argument.
CPI = 4/5 = 0.8 < 0.5 (ideal)
Loop Unrolling (Index Multiple of 4)Loop Unrolling (Index Multiple of 4)ALU or branch
instructionData transfer instruction
Clock cycle
Loop: addi $s1, $s1, – 16 lw $t0, 0($s1) 1
nop lw $t1, 12($s1) 2
addu $t0, $t0, $s2 lw $t2, 8($s1) 3
addu $t1, $t1, $s2 lw $t3, 4($s1) 4
addu $t2, $t2, $s2 sw $t0, 16($s1) 5
addu $t3, $t3, $s2 sw $t1, 12($s1) 6
nop sw $t2, 8($s1) 7
bne $s1, $0, Loop sw $t3, 4($s1) 8
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 88
CPI = 8/14 = 0.57 < 0.5 (ideal)
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 99
VLIW: Very Long Instruction WordVLIW: Very Long Instruction Word
Static multiple issue, ILP determined by compiler.Static multiple issue, ILP determined by compiler.Datapath contains multiple execution units.Datapath contains multiple execution units.Compiler groups instructions that have no data or resource Compiler groups instructions that have no data or resource conflicts for parallel execution.conflicts for parallel execution.Grouped instructions are packed in very long words of a Grouped instructions are packed in very long words of a wide instruction memory.wide instruction memory.Speedup benefit of VLIW is highly program dependent.Speedup benefit of VLIW is highly program dependent.J. A. Fisher, “Very Long Instruction Word Architecture and J. A. Fisher, “Very Long Instruction Word Architecture and ELI-512,” ELI-512,” Proc. 10Proc. 10thth Symp. on Computer Architecture Symp. on Computer Architecture, , Stockholm, June 1983, pp. 478-490.Stockholm, June 1983, pp. 478-490.J. A. Fisher, P. Faraboschi and C. Young, Embedded J. A. Fisher, P. Faraboschi and C. Young, Embedded Computing: Computing: A VLIW Approach to Architecture, Compilers A VLIW Approach to Architecture, Compilers and Tools, and Tools, Morgan Kaufmann.Morgan Kaufmann.
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 1010
Superscalar: Dynamic Scheduling Superscalar: Dynamic Scheduling and Out-of-Order Executionand Out-of-Order Execution
Instruction fetch and decode unit
Reservation station
Reservation station
Reservation station
Reservation station
Commit unit
integer integerFloating
pointLoad/ store
Functional units
In-order issue
Out-of-order execution
In-order commit
Out-of-order issue
Out of Order Execution (OOE)Out of Order Execution (OOE)
A procedural programming language A procedural programming language sequences instructions.sequences instructions.
Sequencing assumes an order of Sequencing assumes an order of execution – no parallelism.execution – no parallelism.
OOE must preserve correctness of result.OOE must preserve correctness of result.
Principle: Two instructions can be Principle: Two instructions can be executes in parallel if they do not have executes in parallel if they do not have dependences.dependences.
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 1111
RAW DependenceRAW Dependence
Read after write (RAW): A dependent Read after write (RAW): A dependent instruction reads from a register being instruction reads from a register being written to by another instruction.written to by another instruction.
Example:Example:
addadd $s1, $s2, $s3$s1, $s2, $s3
subsub $s2, $s1, $s3$s2, $s1, $s3
sub has RAW dependence on addsub has RAW dependence on add
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 1212
WAR DependenceWAR Dependence
Write after read (WAR): A dependent Write after read (WAR): A dependent instruction writes to a register being read instruction writes to a register being read by another instruction.by another instruction.
Example:Example:
addadd $s1, $s2, $s3$s1, $s2, $s3
subsub $s2, $s1, $s3$s2, $s1, $s3
sub has WAR dependence on addsub has WAR dependence on add
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 1313
WAW DependenceWAW Dependence
Read after write (RAW): One instruction Read after write (RAW): One instruction writes to a register to being written to by writes to a register to being written to by another instruction.another instruction.
Example:Example:
addadd $s2, $s2, $s3$s2, $s2, $s3
subsub $s2, $s1, $s3$s2, $s1, $s3
sub has WAW dependence on addsub has WAW dependence on add
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 1414
Superscalar Instruction IssueSuperscalar Instruction IssueRules:Rules:
RAW dependence – If any operand is being written, do RAW dependence – If any operand is being written, do not issue.not issue.
WAR dependence – If the result register is being read, WAR dependence – If the result register is being read, do not issue.do not issue.
WAW dependence – If the result register is being WAW dependence – If the result register is being written, do not issue.written, do not issue.
Scoreboard: Scoreboard: Cycle by cycle record of registers and execution units Cycle by cycle record of registers and execution units showing how many instructions are using them.showing how many instructions are using them.
Example 1: In-order issue (next 2 slides).Example 1: In-order issue (next 2 slides).
Example 2: Out-of-order issue (3Example 2: Out-of-order issue (3rdrd slide). slide).
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 1515
Dynamic SchedulingDynamic Scheduling
Consider an example:Consider an example:First with in-order issueFirst with in-order issue
Then with out-of-order issueThen with out-of-order issue
Assume:Assume:Up to two instructions are fetched in a cycleUp to two instructions are fetched in a cycle
Instruction register can hold two instructionsInstruction register can hold two instructions
An Instruction is issued in decode cycle, or must wait An Instruction is issued in decode cycle, or must wait until there is no RAW, WAR or WAW dependenceuntil there is no RAW, WAR or WAW dependence
An instruction can retire two or three cycles after it is An instruction can retire two or three cycles after it is issuedissued
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 1616
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 1717
Ckcycle
Inst
# DecodedIssueInst#
RetireInst#
Reg. to read Reg. to write
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
1 12
R3 = R0 * R1R4 = R0 + R2
12
12
11 1
11 1
2 34
R5 = R0 + R1R6 = R1 + R4
3-
33
22
11
11
11
11
3 3 2 1 1 1 1
4 123
21
11
1 1 11
55 R7 = R1 * R2
45
12 1
11
11 1
6 6 R1 = R0 – R2 - 2 1 1 1 1
7 4 1 1 1
8 5
97 R3 = R3 * R1
6-
11
11
11
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 1818
Ckcycle
Instr
# DecodedIssueInst#
RetireInst#
Reg. to read Reg. to write
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
10 1 1 1
11 6
128 R1 = R4 + R4
7-
11
11
11
13 1 1 1
14 1 1 1
15 7
16 8 2 1
17 2 1
18 8
In-order Issue scoreboard (Continued)
Out-of-order scoreboard (Next 2 Slides)
Questions?Questions?RAW dependence: Inst# 4 (R6 = R1 + R4) could RAW dependence: Inst# 4 (R6 = R1 + R4) could not be issued until cycle 5. Should Inst# 5 (R7 = R1 not be issued until cycle 5. Should Inst# 5 (R7 = R1 * R2) wait in queue?* R2) wait in queue?
Answer: No. Inst# 5 can be issued in cycle 3 as Answer: No. Inst# 5 can be issued in cycle 3 as there is no register conflict there is no register conflict (out-of-order issue).(out-of-order issue).
WAR dependence: Must the issue of Inst#6 (R1 = WAR dependence: Must the issue of Inst#6 (R1 = R0 – R2) waits until cycle 9 when all instructions R0 – R2) waits until cycle 9 when all instructions reading R1 have retired?reading R1 have retired?
Answer: No. Provided new result of Inst#6 does not Answer: No. Provided new result of Inst#6 does not affect R1 being used by previous instructions affect R1 being used by previous instructions (register renaming).(register renaming).
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 1919
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 2020
Ckcycle
Inst#
DecodedIssueInst#
RetireInst#
Reg. to read Reg. to write
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
1 12
R3 = R0 * R1R4 = R0 + R2
12
12
11 1
11 1
2 34
R5 = R0 + R1R6 = R1 + R4
3-
33
22
11
11
11
11
3 56
R7 = R1 * R2S1 = R0 – R2
56
2
343
333
232
111
11
111
111
478
R3 = R3 * S1S2 = R4 + R4
4-8
13
33321
44432
22222
11333
111
1111
11111
11111
5 6 2 1 3 1 1 1
6 7458
21
11
1111
322
111
1111
1 11
7 1 1
8 1 1
9 7
ReferencesReferences
Previous example is from:Previous example is from:A. S. Tanenbaum, A. S. Tanenbaum, Structured Computer Structured Computer Organization, Fifth EditionOrganization, Fifth Edition, Prentice-Hall, 2006, pp. , Prentice-Hall, 2006, pp. 304-309, Section 4.5.3.304-309, Section 4.5.3.
Further reading:Further reading:D. W. Anderson, F. J. Sparacio and R. M. D. W. Anderson, F. J. Sparacio and R. M. Tomasulo, “The IBM 360 Model 91: Processor Tomasulo, “The IBM 360 Model 91: Processor Philosophy and Instruction Handling,” Philosophy and Instruction Handling,” IBM J. Res. IBM J. Res. & Dev.& Dev., vol. 11, no. 1, pp. 8-24, Jan. 1967., vol. 11, no. 1, pp. 8-24, Jan. 1967.
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 2121
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 2222
Power Reduction by Slack SchedulingPower Reduction by Slack SchedulingApplication: Superscalar, out-of-order execution:Application: Superscalar, out-of-order execution:
An instruction is executed as soon as the required data and An instruction is executed as soon as the required data and resources become available.resources become available.
A commit unit reorders the results.A commit unit reorders the results.
Delay the completion of instructions whose result Delay the completion of instructions whose result is not immediately needed.is not immediately needed.
Example of RISC instructions:Example of RISC instructions: addadd r0, r1, r2;r0, r1, r2; (A)(A)
sub sub r3, r4, r5;r3, r4, r5; (B)(B)
and and r9, r1, r9;r9, r1, r9; (C)(C)
or or r5, r9, r10;r5, r9, r10; (D)(D)
xor xor r2, r10, r11;r2, r10, r11; (E)(E)
J. Casmira and D. Grunwald,“Dynamic Instruction SchedulingSlack,” Proc. ACM Kool ChipsWorkshop, Dec. 2000.
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 2323
Slack Scheduling ExampleSlack Scheduling Example
Slack schedulingSlack scheduling
AABB CC
DD
EE
Standard schedulingStandard scheduling
AA BB CC
DD
EE
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 2424
Slack SchedulingSlack Scheduling
Slack bitLow-power
execution units(Reduced voltage)
Re-order buffer
Sch
edul
ing
logi
c
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 2525
Superscalar Design of P4 (CISC)Superscalar Design of P4 (CISC)CISC shell:CISC shell:– Processor fetches instructions from memory in the Processor fetches instructions from memory in the
order of static program.order of static program.– Each instruction is translated into one or more fixed-Each instruction is translated into one or more fixed-
length RISC instructions, known as micro-operations length RISC instructions, known as micro-operations (micro-ops).(micro-ops).
RISC core:RISC core:– Micro-ops are executed out-of-order in a dynamically Micro-ops are executed out-of-order in a dynamically
scheduled pipeline.scheduled pipeline.– Processor commits the result of each micro-op Processor commits the result of each micro-op
execution to register file in the order of original execution to register file in the order of original program flow. program flow.
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 2626
SuperscalarsSuperscalars3 or more instruction issues per clock:3 or more instruction issues per clock:
Intel P6Intel P6AMD K5AMD K5Sun UltraSPARCSun UltraSPARCAlpha 21164Alpha 21164MIPS R10000MIPS R10000PowerPC 604/620PowerPC 604/620HP 8000HP 8000
References:References:D. W. Anderson, F. J. Sparacio and R. M. Tomasulo, “The IBM D. W. Anderson, F. J. Sparacio and R. M. Tomasulo, “The IBM 360 Model 91: Processor Philosophy and Instruction Handling,” 360 Model 91: Processor Philosophy and Instruction Handling,” IBM J. Res. DevIBM J. Res. Dev., vol. 11, pp. 8-24, January 1967.., vol. 11, pp. 8-24, January 1967.
T. Agerwala and J. Cocke, “Reduced Instruction Set T. Agerwala and J. Cocke, “Reduced Instruction Set Processors,” Technical Report RC12434 (#55845), Yorktown Processors,” Technical Report RC12434 (#55845), Yorktown Heights, NY: IBM T. J. Watson Research Center, January 1987.Heights, NY: IBM T. J. Watson Research Center, January 1987.
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 2727
Topics in Computer ArchitectureTopics in Computer ArchitectureInstruction setInstruction setProgram execution through register transfer Program execution through register transfer See Lectures 13-14. Computer arithmetic (2’s complement, See Lectures 13-14. Computer arithmetic (2’s complement, IEEE 754 floating point standard, addition, multiplication)IEEE 754 floating point standard, addition, multiplication)Datapaths (single-cycle, multicycle, pipeline)Datapaths (single-cycle, multicycle, pipeline)Control (combinational logic, FSM, microcode)Control (combinational logic, FSM, microcode)Pipelining (throughput, hazards, forwarding, stall, branch Pipelining (throughput, hazards, forwarding, stall, branch prediction)prediction)Memory organization (cache, virtual memory)Memory organization (cache, virtual memory)Performance (benchmarks, energy efficiency, Amdal’s law)Performance (benchmarks, energy efficiency, Amdal’s law)Advanced architectures (ILP, OOE, superscalar, etc.)Advanced architectures (ILP, OOE, superscalar, etc.)Not discussed in this course:Not discussed in this course:– MultiprocessorsMultiprocessors– Compiler and software techniques – loop unrolling, trace execution, etc.Compiler and software techniques – loop unrolling, trace execution, etc.– Input and outputInput and output– Power managementPower management
Fall 2014, Nov 19 . . .Fall 2014, Nov 19 . . . ELEC 5200-001/6200-001 Lecture 12ELEC 5200-001/6200-001 Lecture 12 2828
One who claims to know much about computer
architecture speaks from ignorance . . . because
a lot is going to happen in the future, which is . . .
http://www.youtube.com/watch?v=xZbKHDPPrrc
Doris Day in Hitchcock’s 1956 Movie
“The Man Who Knew Too Much”