Evolution of the ILP Processing

Evolution of the ILP Processing

Dezső Sima

Fall 2008

(Ver. 2.0) Dezső Sima, 2008

Foreword

The steady demand for higher processor performance has provoked the successive introduction of temporal, issue and intra-instruction parallelism into processor operation. Consequently, traditional sequential processors, pipelined processors, superscalar processors and superscalar processors with multimedia and 3D support mark subsequent evolutionary phases of microprocessors.

On the other hand the introduction of each basic technique mentioned gave rise to specific system bottlenecks whose resolution called for innovative new techniques. Thus, the emergence of pipelined instruction processing stimulated the introduction of caches and of speculative branch processing. The debut of superscalar instruction issue gave rise to more advanced memory subsystems and to more advanced branch processing. The desire to further increase per cycle performance of first generation superscalars called for avoiding their issue bottleneck by the introduction of shelving, renaming and a concerted enhancement of all relevant subsystems of the microarchitecture. Finally, the utilization of intra-instruction parallelism through SIMD instructions required an adequate extension of the ISA and the system architecture.

With the main dimensions of the parallelism - more or less exhausted in the second generation superscalars for general purpose applications -, increasing the clock frequency remained the single major possibility to increase performance further on. The rapid increase of the clock frequencies, however led to limits of evolution, as discussed in Chapter II.

Structure

1. Paradigms of ILP-processing•

2. Introduction of temporal parallelism•

3. Introduction of issue parallelism•

3.1. VLIW processing•

3.2. Supercalar processing•

4. Introduction of data parallelism•

5. The main road of evolution •

6. Outlook•

1. Paradigms of ILP-processing

mainframe

1950 1960 1970 1980 1990 2000

minicomputer

microcomputer

x

UNIVAC

4004

/370 /390 z/900

server/workstation

desktop PC

value PC

80808088

80286

80386

80486

Pentium

PII PIII P4

Celeron

/360

PDP-8 PDP-11 VAX

RS/6000

PPro

Xeon

super-computer

ENIAC CDC-6600

?Cray-1 Cray-2NORC Cray-3Cray T3E

Cray-4

8088Altair

Figure 1.1: Evolution of computer classes

1.1. Introduction (1)

1.2. ábra: The integer performance of Intel’s x86 line of processors

SPECint92

5

10

50

Year86 8879 1980 81 82 83 84 85 87 89 1990 91 92 93 94 95 96 97 98 99

*

*

*

**

*

**

2

386/16

*

* *

*

*

* 8088/5

*0.5

100

8088/8

80286/10

80286/12

386/20 386/25

386/33

500

*

*

*1000

20

200

1

0.2

*

***

**

*

486/25

486/33486/50 486-DX2/66

Pentium/66

Pentium/100 Pentium/120

Pentium Pro/200

PII/450

PIII/600

486-DX4/100

Pentium/133 Pentium/166

Pentium/200

PII/300PII/400 PIII/500

486-DX2/50*

2000 01 02 03

5000

2000*

*

*

*

*

** *

*

PIII/1000

P4/1500P4/1700

P4/2000 P4/2200P4/2400 P4/2800

P4/3060

P4/3200

~ 100*/10 years

*

*

***

04 05

Northwood B

10000

Prescott (1M)Prescott (2M)

Leveling off


Pipeline

processors

Temporal parallelism Issue parallelism

Paradigms of ILP-processing

Static dependency resolution

VLIW processors

1.2. Paradigms of ILP-processing (1)

VLIW processing

FE

FE

FE

VLIW: Very Large Instruction Word

Independent instructions (static dependency resolution)

Processor

Instructions

Pipeline

processors


Paradigms of ILP processing

Staticdependency resolution

Dynamic dependency resolution

VLIW processors

Superscalar processors

1.2. Paradigms of ILP processing (1)

Instructions

VLIW processing

FE

FE

FE

VLIW: Very Large Instruction Word

Independent instructions (static dependency resolution)

Processor

Superscalar processing

FE

FE

FE

Dynamicdependency resolution

Processor

Dependent instructions

SIMD

extension

Data parallelism

1.2. Paradigms of ILP processing (1)

Pipeline processors


Paradigms of ILP processing

Staticdependency resolution


VLIW processors


1.2. Paradigms of ILP-processing (2)

~ ‘90~ ‘85 ~ ’95 -‘00


Pipeline processors.

VLIW processors EPIC processors

Superscalar proc.swith SIMD extension

Figure 1.3: The emergence of ILP-paradigms and processor types

Sequential processing

Temporal parallelism

Issue parallelism Data parallelism

Static dependency resolution


1.3. Performance potential of ILP-processors (1)

Absolute performance Ideal case Real case

CPIfP Cai

1Sequential

iCPI

PipelineCPI

fP Cai

1

iCPI

VLIW/superscalar

IPCPI

fP Cai 1iIPiIP

SIMD extension OPIIP

CPIfP Cao

1iOPI

1.3. ILP processzorok teljesítménypotenciálja (2)

Clock frequencyDepends on technology/

μarchitecture

Per cycle efficiencyDepends on ISA, μarchitecture, systemarchitecture, OS, compiler, application

OPIIPCPI

fP Cao

1

Clockfrequency

Temporalparall.

Issueparall.

Dataparall.

Efficiency of spec. exec.

Performance components of ILP-processors:

effCao IPCfP

OPIIPCPI

IPCeff

1with:

2. Introduction of temporal parallelism

(F: fetch cycle, D: decode cycle, E: execute cycle, W: write cycle)

Mainframes

Microprocessors

Sequentialprocessing

ii +1iiF D E W

Earlymainframes

F D

Overlapping the fetch and further phases

F D E Wii

+1ii F D E W

+2ii

Prefetching

i80286 (1982)39

M68020 (1985)40

Stretch (1961)34

Overlapping the executephases through pipelining

ii

+1ii

+3ii

+2ii

E E E1 2 3

PipelinedEUs

IBM 360/91 (1967)CDC 7600 (1969)

35

36

Overlapping all phases

ii

+1ii

+3ii

+2ii

F E WD

Pipelineprocessors

Atlas (1963)37

IBM 360/91 (1967)38

R2000 (1988)41

i80386 (1985)42

M68030 (1988)43


Types of temporal parallelism in ILP processors

Figure 2.1: Implementation alternatives of temporal parallelism


Figure 2.2: The appearance of pipeline processors

x86

M68000

MIPS R

1980 81 82 83 84 85 86 87 88 89 1990 91 92

80386 80486

68030 68040

R3000 R6000 R4000

Pipeline (scalar) processors

R2000

68020

80286

2.2. Processing bottlenecks evoked and their resolution

The problem of branch processing

The scarcity of memory bandwidth

(2.2.2)

(2.2.3)

•

•

2.2.1. Overview

2.2.2. The scarcity of memory bandwidth (1)

Larger memory bandwidth

Sequential processing Pipeline processing

More instructions and data

need to be fetched per cycle

2.2.2. The scarcity of memory bandwidth (2)

Figure 2.3: Introduction of caches

x86

M68000

MIPS R

1980 81 82 83 84 85 86 87 88 89 1990 91 92

80386 80486

68030 68040

R3000 R6000 R4000

C(8)

C(1/4,1/4) C(4,4)

C(4,4) C(4,4) C(16) C(8,8)

C(0,1/4)

R2000

68020

80286

Pipeline (scalar) processors with cache(s)

C(n) Universal cache (size in kB)

C(n/m) Instruction/data cache (sizes in kB)

Pipeline (scalar) processors without cache(s)

ii+2

ii+1

2.2.3. The problem of branch processing (1)

(E.g. in case of conditional branches)

Figure 2.4: Processing of a conditional branch on a 4-stage pipeline

F bti ii+4

D

E

W

Brach addresscalculation

F

D

E

Conditionchecking(branch!)

D

F

Decode

Fbc ii

clock cycles

Branch target instruction bti

Conditional branchbc


Figure 2. 5: Principle of branch prediction in case of a conditional branch

Conditional branchesInstructions other than conditional branches

Guessed path

Basic block

Basic block

Approved path


x86

M68000

MIPS R

1980 81 82 83 84 85 86 87 88 89 1990 91 92

80386 80486

68030 68040

R3000 R6000 R4000

C(8)

C(1/4,1/4) C(4,4)

C(4,4) C(4,4) C(16) C(8,8)

(Scalar) pipeline processors

Speculative execution of branches

C(0,1/4)

R2000

68020

80286

Figure 2.6: Introduction of branch prediction in (scalar) pipeline processors

2. generation pipelined

1.5. generation pipelined

1. generation pipelined

Cache Speculative branch processing

no no

yes no

yes yes

2.3. Generations of pipeline processors (1)

2.3. Generations of pipeline processors (2)

x86

M68000

MIPS R

1980 81 82 83 84 85 86 87 88 89 1990 91 92

80386 80486

68030 68040

R3000 R6000 R4000

C(8)

C(1/4,1/4) C(4,4)

C(4,4) C(4,4) C(16) C(8,8)

2. generation pipelined (cache, speculative branch processing)

C(0,1/4)

R2000

68020

80286

1.5. generation pipelined (cache, no speculative branch processing)

1. generation pipelined (no cache, no speculative branch processing)

Figure 2. 7: Generations of pipeline processors

2. generation pipeline processors already exhaust the available temporal parallelism

2.4. Exhausting the available temporal parallelism

3. Introduction of issue parallelism

Pipeline processing Superscalarinstruction issue

VLIW (EPIC)instruction issue

Static dependency

resolution

(3.2)

Dynamic dependency

resolution

(3.3)

3.1. Options to implement issue parallelism

3.2. VLIW processing (1)

E

U

E

U

E

U

E

U

Memory/cache VLIW instructions

with independent sub-instructions

(static dependency resolution)

VLIW processor

~ (10-30 EUs)

Figure 3.1: Principle of VLIW processing


VLIW: Very Long Instruction Word

Term: 1983 (Fisher)

Length of sub-instructions ~32 bit

Instruction length: ~ n*32 bit

n: Number of execution units (EU)

Complex VLIW compiler

Static dependency resulution with parallel optimization


Figure 3.2: Experimental and commercially available VLIW processors

The term ‘VLIW’

Source: Sima et al., ACA, Addison-Wesley, 1997


Benefits of static dependecy resolution:

Earlier appearance

Either higher fc orlarger ILP

Less complex processors


The compiler uses technology dependent parameters (e.g. latencies of EUs and caches, repetition rates of EUs)

for dependency resolution and parallel optimization

Drawbacks of static dependency resolution:

New proc. models require new compiler versions

Completely new ISA

New compilers, OS

Rewriting of applications

Achieving the critical mass to convince the market


Drawbacks of static dependency resolution (cont.):

VLIW instructions are only partially filled

Purely utilized memory space and bandwidth


Commercial VLIW processors:

In a few years both firms became bankrupt

Developers: to HP, IBM

They became initiators/developers of EPIC processors

Trace (1987) MultiflowCydra-5 (1989) Cydrome


Integration of SIMD instructions and advanced superscalar features

VLIW EPIC

1994: Intel, HP announced the cooperation

2001: IA-64 Itanium

1997: The EPIC term was born

3.3. Superscalar processing

3.3.1. Introduction (1)

Pipeline processing Superscalar instruction issue

Main attributes of superscalar processing:


Compatible ISA

3.3.1. Intoduction (2)

Figure 3.3: Experimental superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997

Intel 960 960KA/KB

M 88000 MC 88100

HP PA PA 7000

SPARC MicroSparc

Mips R R 4000

Am 29000 29040

IBM Power

DEC

PowerPC

87 88 89 90 91 92 93 94 95 96

CISC processors

RISC processors

Intel x86 i486

M 68000 M 68040

Gmicro Gmicro/100p

AMD K5

CYRIX M1

denotes superscalar processors.

960CA (3)

Power1(4)RS/6000

21064(2)

MC 88110 (2)

PA7100 (2)

SuperSparc (3)

R 8000 (4)

PPC 601 (3)PPC 603 (3)

29000 sup (4)

Pentium(2)

M 68060 (2)

Gmicro500(2)

K5 (4)

M1 (2)

3.3.1. Introduction (3)

Figure 3.4: Emergence of superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997

3.3.2. Attributes of first generation superscalars (1)

Cache:

Width:• 2-3 RISC instructions/cycle or • 2 CISC instructions/cycle

„wide”Core: • Static branch prediction

• Single ported, blocking L1 data caches, Off-chip L2 caches attached via the processor bus

Examples:

• Pentium• PA 7100• Alpha 21064

Consistency of processor features (1)

Dynamic instruction frequencies in gen. purpose applications:

(Wall 1989, Lam, Wilson 1992)


FX instrtuctions ~ 40 %

Load instructions ~ 30 %

Store instructions ~ 10 %

Branches ~ 20 %FP instrtuctions ~ 1-

5 %

Available parallelism in gen. purpose applications assuming direct issue:

~ 2 instructions / cycle


Required EU-s (Each L/S instruction generates an address calculation as well):

2 - 3 instructions/cycle

Single port data caches

Required number of data cache ports (np):

Reasonable core width:

np ~ 0.4 * (2 - 3) = 0.8 – 1.2 instructions/cycle

FX ~ 0.8 * (2 – 3) = 1.6 – 2.4 2 – 3 FX EUsL/S ~ 0.4 * (2 – 3) = 0.8 – 1.2 1 L/S EUBranch ~ 0.2 * (2 – 3) = 0.4 – 0.6 1 B EUFP ~ (0.01 – 0.05) * (2 – 3) 1 FP EU



The issue bottleneck

(b): The issue process

Executable instructionsDependent instructionsIssue

Ci

C i+1

Ci+2

i4i5i6

i1i2i3

Instr. window

i2i3

Cycles

(a): Simplified structure of the mikroarchitecture assuming direct issue

Icache

I-buffer

Instr. window (3)

Decode,check,issue

Dependent instructions block instruction issue

EU

Issue

EU EU

Decode,check,issue

Figure 3.5: The principle of direct issue

3.3.3. The bottleneck evoked and its resolution (1)

I cache

I-buffer

Decode/IssueInstructions are dispatched withoutchecking for dependences to theshelving buffers (reservation stations)

Shelved not dependent

for execution to the EUs.

Dep. checking/issue

Dep. checking/issue

EU EU EU

Instruction window

instructions are issued

Shelvingbuffer

Shelvingbuffer

Shelvingbuffer

Issue

Dispatch

Dep. checking/issue

Dep. checking/issue

Dep. checking/issue


Figure 3.6: Principle of the buffered (out of order) issue

Eliminating the issue bottleneck


First generation (narrow)

superscalars

Second generation (wide)

superscalars

Elimination of the issue bottleneck and in addition

widening the processing width of all subsystems of the core

3.3.4. Attributes of second generation superscalars (1)

Caches:

Core:

First generation ”narrow” superscalars

Second generation ”wide” superscalars

Width: • 2-3 RISC instructions/cycle or2 CISC instructions/cycle „wide”

• 4 RISC instructions/cycles or3 CISC instruction/cycle „wide”

• Static branch prediction• Buffered (ooo) issue• Predecoding• Dynamic branch prediction• Register renaming• ROB

• Single-ported, blocking L1 data caches• Off-chip L2 caches attached via the processor bus

• Dual-ported, non-blockingL1 data caches

• direct attached off-chip L2 caches

Examples:

• Pentium • Pentium Pro• K6

• PA 7100 • PA 8000• Alpha 21064 • Alpha 21264


Dynamic instruction frequencies in gen. purpose applications:

(Wall 1990)


FX instrtuctions ~ 40 %

Load instructions ~ 30 %

Store instructions ~ 10 %

Branches ~ 20 %FP instrtuctions ~ 1-

5 %

Available parallelism in gen. purpose applications assuming buffered issue:

~ 4 – 6 instructions / cycle


Source: Wall: Limits of ILP, WRL TN-15, Dec. 1990

Figure 3.7: Extent of parallelism available in general purpose applications assuming buffered issue

Required EU-s (Each L/S instruction generates an address calculation as well):

4 - 5 instructions/cycle

Dual port data caches

Required number of data cache ports (np):

Reasonable core width:

np ~ 0.4 * (4 - 5) = 1.6 – 2 instructions/cycle

FX ~ 0.8 * (4 – 5) = 3.2 – 4 3 – 4 FX EUsL/S ~ 0.4 * (4 – 5) = 1.6 – 2 2 L/S EUBranch ~ 0.2 * (4 – 5) = 0.8 – 1 1 B EUFP ~ (0.01 – 0.05) * (4 – 5) 1 FP EU



In general purpose applications 2. generation („wide”) superscalars

already exhaust the parallelismavailable at the instruction level

3.3.5. Exhausting the issue parallelism

4. Introduction of data parallelism

4.1. Overview (1)

Figure 4.1: Implementation alternatives of data parallelism

Dual-operationinstructions

Possible approaches to introducedata parallelism

OPIn 2

(i=a*b+c)

i: O2 O1

OPI 1+Dedicated use

(for gen.use)

instructionsSIMD

ISA-extension

OPI : Average number of operations per instruction

2/4/8/16/32

(MM-support)

O1O4 O3 O2i:

Dedicated use

>1

i: O2 O1

2/4

(3D-support)

FX-SIMD FP-SIMD

OPIn : Number of operations per instruction

Superscalar issueMultiple operations

within a single instruction

Superscalar extension

EPIC

extension

SIMD instructions(FX/FP)

4.1. Overview (2)

Figure 4.2: Principle of intruducing SIMD instructions in superscalar and VLIW (EPIC) processors

4.2. The appeareance of SIMD instructions in superscalars (1)

Compaq/DEC

Motorola

Sun/Hal

MIPS

HP

Alpha 21064 Alpha 21164 21264

MC88110

R 12000R 10000

PA7100 PA8000 PA 8500 PA-7200 PA-7100LC PA-8200

21164PC

CYRIX /VIA

AMD/NexGen

Intel Pentium PentiumPro

K5 Nx586

Pentium III

K7 K6

MII

Pentium II

Pentium/MMX

K6-2 K6-3

Multimedia support (FX-SIMD)

Support of 3D (FP-SIMD)

1992 1993 1994 1995 1996 1997 1998 1999

RISC processors

MC 88000

PA

Alpha

SPARC

PowerPC

R

Nx/K

80x86

CISC processors

Power PCAlliance

PPC 601 (3)

PPC 603 (3) PPC 602 (2)

PPC 604 (4)

R 80000

G3 (3) Power3 (4)

SuperSparc UltraSparc UltraSparc-2 UltraSparc-3

G4 (3)

19911990

IBM Power Power1(4) Power2(6/4) P2SC(6/4)

PPC 620 (4)

Sparc64

M1 M

2002 200320012000

PA 8600

UltraSparc-3

Pentium 4

Power 4

Alpha 21364

PA 8700

Power 4+

R14000

UltraSparc-3-Cu

R16000

Opteron

P4 with HT

Figure 4.3: The emergence of FX-SIMD and FP-SIMD instructions in superscalars

Intel’s and AMD’s ISA extensions (MMX, SSE, SSE2, SSE3, 3DNow!, 3DNowProfessional)

A 2.5. and 3. generation superscalars (1)

Second generation superscalars

3. generation superscalars

2.5. generation superscalars•

•

FX SIMD (MM)

FX SIMD + FP SIMD (MM+3D)

2.5. and 3. generation superscalars (2)

Compaq/DEC

Motorola

Sun/Hal

MIPS

HP

Alpha 21064 Alpha 21164 21264

MC88110

R 12000R 10000

PA7100 PA8000 PA 8500 PA-7200 PA-7100LC PA-8200

21164PC

CYRIX /VIA

AMD/NexGen

Intel Pentium PentiumPro

K5 Nx586

Pentium III

K7 K6

MII

Pentium II

Pentium/MMX

K6-2 K6-3

Multimedia support (FX-SIMD)

Support of 3D (FP-SIMD)

1992 1993 1994 1995 1996 1997 1998 1999

RISC processors

MC 88000

PA

Alpha

SPARC

PowerPC

R

Nx/K

80x86

CISC processors

Power PCAlliance

PPC 601 (3)

PPC 603 (3) PPC 602 (2)

PPC 604 (4)

R 80000

G3 (3) Power3 (4)

SuperSparc UltraSparc UltraSparc-2 UltraSparc-3

G4 (3)

19911990

IBM Power Power1(4) Power2(6/4) P2SC(6/4)

PPC 620 (4)

Sparc64

M1 M

2002 200320012000

PA 8600

UltraSparc-3

Pentium 4

Power 4

Alpha 21364

PA 8700

Power 4+

R14000

UltraSparc-3-Cu

R16000

Opteron

P4 with HT

Figure 4.4: The emergence of 2.5. and 3. generation superscalars

Bottlenecks evoked by third generation superscalars

System architecture (memory, display)

AGP bus

On-chip L2

Superscalars

First Generation Second Generation 2.5 Generation

2-3 RISC instructions/cycle or 4 RISC instructions/cycle or

Unbuffered issueNo renaming

Single ported data caches

Static branch prediction

Off-chip L2 cachesattached via the processor bus

No MM/3D support

Examples:Alpha 21064

PA 7100

PowerPC 601

SuperSparc

Pentium

Alpha 21264

PA 8000

PowerPC 604

UltraSparc I, II

Pentium Pro

PowerPC 620

Pentium II

Buffered issue (shelving)

Renaming

Predecoding

Dual ported data caches

Dynamic branch prediction

Off-chip direct coupled L2 caches

K6

FX-SIMD

1

1,4

2

No ROB ROB

Blocking L1 data caches or nonblocking caches with up toa single pending cache miss allowed

Nonblocking L1 data caches with multiple cache misses allowed

Features:

("Thin superscalars") ("Wide superscalars") ("Wide superscalars with MM/3D support")

2 CISC instructions/cycle "wide" 3 CISC instructions/cycle "wide"

No predecoding

1 No renaming.

Power2 3

1,4 2

3

4

Dual ported data cache, optional

No off-chip direct coupled L2.Only single ported data cache.

Width:

Core:

Caches:

ISA:instructions

Third Generation

Power 4

Pentium III (0.18 )Pentium 4

Athlon (model 4)

On-chip L2 caches

FX- and FP-SIMD

Athlon MP (model 6)

("Wide superscalars with MM/3D support")

instructions

dynamic branch prediction.

Performance

Complexity

Memory Bandwidth

Branch predictionaccuracy

4.3. Overview of superscalar processor generations

In general purpose applications second generation superscalars already exhaust the parallelism available at the instruction level,

whereas third generation superscalars exhaust already the parallelism available in dedicated applications

(such as MM or 3D applications)at the instruction level as well.

Thus the era of ILP-processors came to an end.

4.4. Exhausting the performance potential of data parallelism

4.5. The introduction of SIMD instructions in EPIC (VLIW) processors

VLIW architectures/processors did not support SIMD instructions

EPIC architectectures/processors inherently support SIMD instructions (like the IA-64 ISA or processors of the Itanium family)

5. Summing up the main road of evolution

Introduction and increase of temporal

parallelism

Introduction and increase of issue

parallelism

Introductionof VLIW processing

Introduction and increase of data

parallelism

Introduction of data parallelism

(EPIC)

a. Evolutionary scenario (Superscalar approach)

(The main road)

b. Radical scenario (VLIW/EPIC approach)

5.1. Main evolution scenarios

Traditional von N. procs.

Superscalar processors with SIMD extension

+ Data parallelism


+ Issue parallelism

Pipeline processors

Temporal parallelism

Figure 5.1: The three cycles of the main road of processor evolution

Extent of

opereration level

parallelism

Level of

hardware

redundancy

SequentialILP

~ 1985/88 ~ 1990/93 ~ 1994/00

processing

t

5.2. Main road of processor evolution (1)

introduction of a particular dimension of parallelism

processing bottleneck(s) arise

elimination of the bottleneck(s) evoked by introducing appropriate techniques

as a consequence, parallelism available at the given dimension becomes exhausted,

further performance increase is achievable onlyby introducing a new dimension of parallelism

i: Introduction of temporal issue and data parallelism

i=1:3

5.2. The main road of evolution (2)

Figure 5.2: Three main cycles of the main road

Figure 5.3: New techniques introduced in the three main cycles of processor evolution

Introduction of

Advanced memory subsystemAdvanced branch processing

processorsSuperscalar

Introduction ofissue parallelism

SIMD extension Superscalars with

ISA extension

Introduction of data parallelism

Traditional sequential processing

~ 1985/88 ~ 1990/93 ~ 1994/97

1. generation 2. generation

• Dynamic inst. scheduling• Renaming• Predecoding• Dynamic branch prediction• ROB• Dual ported data caches• Nonblocking L1 data caches

with multiple cache missesallowed

• Off-chip direct coupled L2 caches

2.5. generation• FX SIMD extension• Extension of system architecture

AGP On-chip L2 ...

3. generation• FP SIMD extension ...

Introduction of temporal parallelism

1. generation 1.5. generation

• Caches

2. generation• Branch prediction

Traditional sequential procesors Pipeline processors

5.2. Main road of the evolution (3)

~ 1985t

~ 2000

Memory Bandwidth

HardwareComplexity

Performance

Figure 5.4: Memory bandwidth and hardware complexity vs raising processor performance

5.2. Main road of evolution (4)

~ 1985t~ 2000

Accuracy ofbranch prediction

Number ofpipeline stages

fc

Figure 5.5: Branch prediction accuracy vs raising clock rates

5.2. Main road of evolution (5)

6. Outlook: introduction of thread level par.

6. Outlook: the introduction of thread level parallelism (1)

ILP(instruction-level parallelism)

TP(thread-level parallelism)

Thread(instruction flow)

Multiple threds

Granularity of parallelism


Where multiple threads can come from?

from the sameapplications

Multiprogramming Multitasking,Multithreading

from differentapplications


Basic implementation alternatives of thread level parallelism

implementation by two or more cores placed on the same chip

implementation by a multithreaded core

Chip

CMP: Chip Multiprocessing(SMP: Symmetric Multiprocessing)

SMT: Simultaneous Multithreading(HT: Hyperthreading (Intel))

L3/Memory

L2/L3Core Core

L3/Memory

SMTcore

L2/L3


(Four-way) superscalar

Multithreaded superscalar(four-way/two threads)

Thread Thread 2Thread 1

SMT: Simultaneous Multithreading(HT: Hyperthreading (Intel))

Evolution of the ILP Processing

Documents

Transcript of Evolution of the ILP Processing