1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.
-
Upload
esther-jenkins -
Category
Documents
-
view
219 -
download
0
Transcript of 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.
![Page 1: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/1.jpg)
1
High Performance Processor Architecture
André SeznecIRISA/INRIA
ALF project-team
![Page 2: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/2.jpg)
2
Moore’s « Law »
Nb of transistors on a micro processor chip doubles every 18 months 1972: 2000 transistors (Intel 4004) 1979: 30000 transistors (Intel 8086) 1989: 1 M transistors (Intel 80486) 1999: 130 M transistors (HP PA-8500) 2005: 1,7 billion transistors (Intel Itanium Montecito)
Processor performance doubles every 18 months 1989: Intel 80486 16 Mhz (< 1inst/cycle) 1993 : Intel Pentium 66 Mhz x 2 inst/cycle 1995: Intel PentiumPro 150 Mhz x 3 inst/cycle 06/2000: Intel Pentium III 1Ghz x 3 inst/cycle 09/2002: Intel Pentium 4 2.8 Ghz x 3 inst/cycle 09/2005: Intel Pentium 4, dual core 3.2 Ghz x 3 inst/cycle x 2 processors
![Page 3: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/3.jpg)
3
Not just the IC technology
VLSI brings the transistors, the frequency, ..
Microarchitecture, code generation optimization bring the effective performance
![Page 4: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/4.jpg)
4
The hardware/software interface
transistor
micro-architecture
Instruction Set Architecture (ISA)
Compiler/code generation
High level languagesoftware
hardware
![Page 5: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/5.jpg)
5
Instruction Set Architecture (ISA)
Hardware/software interface: The compiler translates programs in instructions The hardware executes instructions
Examples Intel x86 (1979): still your PC ISA MIPS , SPARC (mid 80’s) Alpha, PowerPC ( 90’ s)
ISAs evolve by successive add-ons: 16 bits to 32 bits, new multimedia instructions, etc
Introduction of a new ISA requires good reasons: New application domains, new constraints No legacy code
![Page 6: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/6.jpg)
6
Microarchitecture
macroscopic vision of the hardware organization
Nor at the transistor level neither at the gate level
But understanding the processor organization at the functional unit level
![Page 7: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/7.jpg)
7
What is microarchitecture about ?
Memory access time is 100 ns Program semantic is sequential
But modern processors can execute 4 instructions every 0.25 ns.
– How can we achieve that ?
![Page 8: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/8.jpg)
8
high performance processors everywhere
General purpose processors (i.e. no special target application domain): Servers, desktop, laptop, PDAs
Embedded processors: Set top boxes, cell phones, automotive, .., Special purpose processor or derived from a general
purpose processor
![Page 9: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/9.jpg)
9
Performance needs
Performance: Reduce the response time: Scientific applications: treats larger problems Data base Signal processing multimedia …
Historically over the last 50 years:
Improving performance for today applications has fostered new even more demanding applications
![Page 10: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/10.jpg)
10
How to improve performance?
transistor
micro-architecture
Instruction set (ISA)
compiler
language
Use a better algorithm
Optimise code
Improve the ISA
More efficient microarchitecture
New technology
![Page 11: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/11.jpg)
11
How are used transistors: evolution
In the 70’s: enriching the ISA Increasing functionalities to decrease instruction number
In the 80’s: caches and registers Decreasing external accesses ISAs from 8 to 16 to 32 bits
In the 90’s: instruction parallelism More instructions, lot of control, lot of speculation More caches
In the 2000’s: More and more: Thread parallelism, core parallelism
![Page 12: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/12.jpg)
12
A few technological facts (2005)
Frequency : 1 - 3.8 Ghz An ALU operation: 1 cycle A floating point operation : 3 cycles Read/write of a registre: 2-3 cycles
Often a critical path ... Read/write of the cache L1: 1-3 cycles
Depends on many implementation choices
![Page 13: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/13.jpg)
13
A few technological parameters (2005)
Integration technology: 90 nm – 65 nm 20-30 millions of transistor logic Cache/predictors: up one billion transistors 20 -75 Watts
75 watts: a limit for cooling at reasonable hardware cost 20 watts: a limit for reasonable laptop power consumption
400-800 pins 939 pins on the Dual-core Athlon
![Page 14: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/14.jpg)
14
The architect challenge
400 mm2 of silicon 2/3 technology generations ahead What will you use for performance ?
Pipelining Instruction Level Parallelism Speculative execution Memory hierarchy Thread parallelism
![Page 15: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/15.jpg)
15
Up to now, what was microarchitecture about ?
Memory access time is 100 ns Program semantic is sequential Instruction life (fetch, decode,..,execute, ..,memory access,..) is
10-20 ns
How can we use the transistors to achieve the highest performance as possible? So far, up to 4 instructions every 0.3 ns
![Page 16: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/16.jpg)
16
The architect tool box for uniprocessor performance
Pipelining
Instruction Level Parallelism
Speculative execution
Memory hierarchy
![Page 17: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/17.jpg)
17
Pipelining
![Page 18: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/18.jpg)
18
Pipelining
Just slice the instruction life in equal stages and launch concurrent execution:
IF … DC … EX M … CTI0I1
I2
IF … DC … EX M … CTIF … DC … EX M … CT
IF … DC … EX M … CT
time
![Page 19: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/19.jpg)
19
Principle
The execution of an instruction is naturally decomposed in successive logical phases
Instructions can be issued sequentially, but without waiting for the completion of the one.
![Page 20: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/20.jpg)
20
Some pipeline examples
MIPS R3000 :
MIPS R4000 :
Very deep pipeline to achieve high frequency Pentium 4: 20 stages minimum Pentium 4 extreme edition: 31 stages minimum
![Page 21: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/21.jpg)
21
pipelining: the limits
Current: 1 cycle = 12-15 gate delays:
• Approximately a 64-bit addition delay
Coming soon ?: 6 - 8 gate delays On Pentium 4:
• ALU is sequenced at double frequency – a 16 bit add delay
![Page 22: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/22.jpg)
22
Caution to long execution
Integer : multiplication : 5-10 cycles division : 20-50 cycles
Floating point: Addition : 2-5 cycles Multiplication: 2-6 cycles Division: 10-50 cycles
![Page 23: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/23.jpg)
23
Dealing with long instructions
Use a specific to execute floating point operations: E.g. a 3 stage execution pipeline
Stay longer in a single stage: Integer multiply and divide
![Page 24: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/24.jpg)
24
sequential semantic issue on a pipeline
There exists situation where sequencing an instruction every cycle would not allow correct execution:
Structural hazards: distinct instructions are competing for a single hardware resource
Data hazards: J follows I and instruction J is accessing an operand that has not been acccessed so far by instruction I
Control hazard: I is a branch, but its target and direction are not known before a few cycles in the pipeline
![Page 25: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/25.jpg)
25
Enforcing the sequential semantic
Hardware management: First detect the hazard, then avoid its effect by delaying the
instruction waiting for the hazard resolution
IF … DC … EX M … CTI0I1
I2
IF … DC … EX M … CTIF … DC … EX M … CT
IF … DC … EX M … CT
time
![Page 26: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/26.jpg)
26
Read After Write
Memory load delay:
![Page 27: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/27.jpg)
27
And how code reordering may help
a= b+c ; d= e+f
![Page 28: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/28.jpg)
28
Control hazard
The current instruction is a branch ( conditional or not) Which instruction is next ?
Number of cycles lost on branchs may be a major issue
![Page 29: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/29.jpg)
29
Control hazards
15 - 30% instructions are branchs Targets and direction are known very late in the
pipeline. Not before: Cycle 7 on DEC 21264 Cycle 11 on Intel Pentium III Cycle 18 on Intel Pentium 4
X inst. are issued per cycles ! Just cannot afford to lose these cycles!
![Page 30: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/30.jpg)
30
Branch prediction / next instruction prediction
PC prediction
IF
DE
EX
WB
MEM
correction on misprediction
Prediction of the address of the next instruction (block)
![Page 31: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/31.jpg)
31
Dynamic branch prediction: just repeat the past
Keep an history on what happens in the past, and guess that the same behavior will occur next time: essentially assumes that the behavior of the application tends to
be repetitive
Implementation: hardware storage tables read at the same time as the instruction cache
What must be predicted: Is there a branch ? Which is its type ? Target of PC relative branch Direction of the conditional branch Target of the indirect branch Target of the procedure return
![Page 32: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/32.jpg)
32
Predicting the direction of a branch
It is more important to correctly predict the direction than to correctly predict the target of a conditional branch
PC relative address: known/computed at execution at decode time
Effective direction computed at execution time
![Page 33: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/33.jpg)
33
Prediction as the last time
for (i=0;i<1000;i++) {
for (j=0;j<N;j++) {
loop body
}
}
2 mispredictions on the first and the last iterations
predictiondirection
11101110
1110111
1 0
mipredictmispredict
mispredictmispredict
![Page 34: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/34.jpg)
34
Exploiting more past: inter correlations
B1: if cond1 and cond2 …
B2: if cond1 …
cond1 cond2 cond1 AND cond2
T
T
N
N
T
N
T
N
T
N
N
N
Using information on B1 to predict B2
If cond1 AND cond2 true (p=1/4), predict cond1 true 100 % correct
Si cond1 AND cond2 false (p= 3/4), predict cond1 false 66% correct
![Page 35: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/35.jpg)
35
Exploiting the past: auto-correlation
for (i=0; i<100; i++)
for (j=0;j<4;j++)
loop body
111011101110
When the last 3 iterations are taken then predict not taken, otherwise predict taken
100 % correct
![Page 36: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/36.jpg)
36
General principle of branch prediction
Info
rmat
ion
on th
e br
anch
Read tables
Fprediction
PC, global history, local history
![Page 37: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/37.jpg)
37
Alpha EV8 predictor: (derived from) (2Bc-gskew)
e-gskew
352 Kbits , cancelled 2001Max hist length > 21, ≈35
![Page 38: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/38.jpg)
38
pc h[0:L1]
ctru
tag
hash hash
=?
ctru
tag
hash hash
=?
ctru
tag
hash hash
=?
prediction
pc pc h[0:L2] pc h[0:L3]
11 1 1 1 1 1
1
1
Current state-of-the-art256 Kbits TAGE: Geometric history length (dec 2006)
Tagless base predictor
3.314 misp/KI
![Page 39: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/39.jpg)
39
ILP: Instruction level parallelism
![Page 40: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/40.jpg)
40
Executing instructions in parallel :supercalar and VLIW processors
Till 1991, pipelining to achieve achieving 1 inst per cycle was the goal :
Pipelining reached limits: Multiplying stages do not lead to higher performance Silicon area was available:
• Parallelism is the natural way
ILP: executing several instructions per cycle Different approachs depending on who is in charge of the control :
• The compiler/software: VLIW (Very Long Instruction Word)
• The hardware: superscalar
![Page 41: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/41.jpg)
41
Instruction Level Parallelism: what is ILP ?
A= B+C; D=E+F; → 8 instructions
1. Ld @ C , R1 (A)
2. Ld @ B, R2 (A)
3. R3← R1+ R2 (B)
4. St @ A, R3 (C )
5. Ld @ E , R4 (A)
6. Ld @ F, R5 (A)
7. R6← R4+ R5 (B)
8. St @ A, R6 (C )
•(A),(B), (C) three groups of independent instructions
•Each group can be executed in //
![Page 42: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/42.jpg)
42
VLIW: Very Long Instruction Word
Each instruction controls explicitly the whole processor The compiler/code scheduler is in charge of all functional
units:• Manages all hazards:
– resource: decides if there are two competing candidates for a resource
– data: ensures that data dependencies will be respected
– control: ?!?
![Page 43: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/43.jpg)
43
VLIW architecture
UF
Controlunit
UF UF UF Mem
ory
inte
rfac
e
Register bank
igtr r6 r5 -> r127, uimm 2 -> r126, iadd r0 r6 -> r36, ld32d (16) r4 -> r33, nop;igtr r6 r5 -> r127, uimm 2 -> r126, iadd r0 r6 -> r36, ld32d (16) r4 -> r33, nop;
![Page 44: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/44.jpg)
44
VLIW (Very Long Instruction Word)
The Control unit issues a single long instruction word per cycle
Each long instruction lanches simultinaeously sevral independant
instructions:
The compiler garantees that:
the subinstructions are independent.
The instruction is independent of all in flight instructions
There is no hardware to enforce depencies
![Page 45: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/45.jpg)
45
VLIW architecture: often used for embedded applications
Binary compatibility is a nightmare: Necessitates the use of the same pipeline structure
Very effective on regular codes with loops and very few control, but poor performance on general purpose application (too many branchs)
Cost-effective hardware implementation: No control:
• Less silicon area• Reduced power consumption• Reduced design and test delays
![Page 46: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/46.jpg)
46
Superscalar processors
The hardware is in charge of the control: The semantic is sequential, the hardware enforces this
semantic • The hardware enforces dependencies:
Binary compatibility with previous generation processors:
All general purpose processors till 1993 are superscalar
![Page 47: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/47.jpg)
47
Superscalar: what are the problems ?
Is there instruction parallelism ? On general-purpose application: 2-8 instructions per cycle On some applications: could be 1000’s,
How to recognize parallelism ?• Enforcing data dependencies
Issuing in parallel: Fetching instructions in // Decoding in parallel Reading operands iin parallel Predicting branchs very far ahead
![Page 48: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/48.jpg)
48
In-order execution
IF … DC … EX M … CTIF … DC … EX M … CT
IF … DC … EX M … CTIF … DC … EX M … CT
IF … DC … EX M … CTIF … DC … EX M … CT
IF … DC … EX M … CTIF … DC … EX M … CT
![Page 49: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/49.jpg)
49
out-of-order execution
To optimize resource usage: Executes as soon as operands are valid
IF DC EX M CTwait waitIF DC EX M
IF DC EX M CTIF DC EX M CT
IF DC EX M CTIF DC EX M CT
CT
![Page 50: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/50.jpg)
50
Out of order execution
Instructions are executed out of order: If inst A is blocked due to the absence of its operands but
inst B has its operands avalaible then B can be executed !!
Generates a lot of hardware complexity !!
![Page 51: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/51.jpg)
51
speculative execution on OOO processors
10-15 % branches: On Pentium 4: direction and target known at cycle 31 !!
Predict and execute speculatively: Validate at execution time State-of-the-art predictors:
• ≈2-3 misprediction per 1000 instructions
Also predict: Memory (in)dependency (limited) data value
![Page 52: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/52.jpg)
52
Out-of-order execution:Just be able to « undo »
branch misprediction Memory dependency misprediction Interruption, exception
Validate (commit) instructions in order Do not do anything definitely out-of-order
![Page 53: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/53.jpg)
53
The memory hierarchy
![Page 54: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/54.jpg)
54
Memory components
Most transistors in a computer system are memory transistors: Main memory:
• Usually DRAM• 1 Gbyte is standard in PCs (2005)• Long access time
– 150 ns = 500 cycles = 2000 instructions
On chip single ported memory:• Caches, predictors, ..
On chip multiported memory:• Register files, L1 cache, ..
![Page 55: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/55.jpg)
55
Memory hierarchy
Memory is : either huge, but slow or small, but fast
The smallest, the fastest
Memory hierarchy goal: Provide the illusion that the whole memory is fast
Principle: exploit the temporal and spatial locality properties of most applications
![Page 56: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/56.jpg)
56
Locality property
On most applications, the following property applies:
Temporal locality : A data/instruction word that has just been accessed is likely to be reaccessed in the near future
Spatial locality: The data/instruction words that are located close (in the address space) to a data/instruction word that has just been accessed is likely to be reaccessed in the near future.
![Page 57: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/57.jpg)
57
A few examples of locality
Temporal locality: Loop index, loop invariants, .. Instructions: loops, ..
• 90%/10% rule of thumb: a program spends 90 % of its excution time on 10 % of the static code ( often much more on much less )
Spatial locality: Arrays of data, data structure Instructions: the next instruction after a non-branch inst is
always executed
![Page 58: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/58.jpg)
58
Cache memory
A cache is small memory which content is an image of a subset of the main memory.
A reference to memory is 1.) presented to the cache 2.) on a miss, the request is presented to next level in the
memory hierarchy (2nd level cache or main memory)
![Page 59: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/59.jpg)
59
Cache
Cache line
Tag
Identifies the memory block
Load &A
If the address of the block sits in the tag array then the block is present in the cache
memory
A
![Page 60: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/60.jpg)
60
Memory hierarchy behavior may dictate performance
Example : 4 instructions/cycle, 1 data memory acces per cycle 10 cycle penalty for accessing 2nd level cache 300 cycles round-trip to memory 2% miss on instructions, 4% miss on data, 1 reference out
4 missing on L2
To execute 400 instructions : 1320 cycles !!
![Page 61: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/61.jpg)
61
Block size
Long blocks: Exploits the spatial locality Loads useless words when spatial locality is poor.
Short blocks: Misses on conntiguous blocks
Experimentally: 16 - 64 bytes for small L1 caches 8-32 Kbytes 64-128 bytes for large caches 256K-4Mbytes
![Page 62: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/62.jpg)
62
Cache hierarchy
Cache hierarchy becomes a standard:
L1: small (<= 64Kbytes), short access time (1-3 cycles)• Inst and data caches
L2: longer access time (7-15 cycles), 512K-2Mbytes• Unified
Coming L3: 2M-8Mbytes (20-30 cycles)• Unified, shared on multiprocessor
![Page 63: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/63.jpg)
63
Cache misses do not stop a processor (completely)
On a L1 cache miss: The request is sent to the L2 cache, but sequencing and
execution continues• On a L2 hit, latency is simply a few cycles• On a L2 miss, latency is hundred of cycles:
– Execution stops after a while
Out-of-order execution allows to initiate several L2 cache misses (serviced in a pipeline mode) at the same time: Latency is partially hiden
![Page 64: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/64.jpg)
64
Prefetching
To avoid misses, one can try to anticipate misses and load the (future) missing blocks in the cache in advance: Many techniques:
• Sequential prefetching: prefetch the sequential blocks• Stride prefetching: recognize a stride pattern and
prefetch the blocks in that pattern
• Hardware and software methods are available:– Many complex issues: latency, pollution, ..
![Page 65: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/65.jpg)
65
Execution time of a short instruction sequence is a complex function !
BranchPredictor
ITLB
I-cacheExecution
core D-cache
DTLB
L2 Cache
Correct
mispredict
hitmiss
hitmiss
hitmiss
hitmiss
hitmiss
![Page 66: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/66.jpg)
66
Code generation issues
First, avoid data misses: 300 cycles mispenalties• Data layout• Loop reorganization• Loop blocking
Instruction generation: minimize instruction count: e.g. common subexpression
elimination schedule instructions to expose ILP Avoid hard-to-predict branches
![Page 67: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/67.jpg)
67
On chip thread level pararallelism
![Page 68: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/68.jpg)
68
One billion transistors now !!
Ultimate 16-32 way superscalar uniprocessor seems unreachable: Just not enough ILP More than quadratic complexity on a few key (power hungry)
components (register file, bypass network, issue logic) To avoid temperature hot spots:
• Intra-CPU very long communications would be needed
On-chip thread parallelism appears as the only viable solution Shared memory processor i.e. chip multiprocessor Simultaneous multithreading Heterogeneous multiprocessing Vector processing
![Page 69: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/69.jpg)
69
The Chip Multiprocessor
Put a shared memory multiprocessor on a single die: Duplicate the processor, its L1 cache, may
be L2, Keep the caches coherentShare the last level of the memory
hierarchy (may be)Share the external interface (to memory
and system)
![Page 70: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/70.jpg)
70
General purpose Chip MultiProcessor (CMP):why it did not (really) appear before 2003
Till 2003 better (economic) usage for transistors: Single process performance is the most important More complex superscalar implementation More cache space:
• Bring the L2 cache on-chip• Enlarge the L2 cache• Include a L3 cache (now)
Diminishing return !!
Now: CMP is the only option !!
![Page 71: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/71.jpg)
71
Simultaneous Multithreading (SMT): parallel processing on a single processor
functional units are underused on superscalar processors
SMT: Sharing the functional units on a superscalar
processor between several process Advantages:
Single process can use all the resources units dynamic sharing of all structures on
parallel/multiprocess workloads
![Page 72: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/72.jpg)
72
Tim
e (P
roce
ssor
cyc
le)
Unuti l i zed Thread 1 Thread 2 Thread 3 Thread 4 Thread 5
Superscalar
Issue slots
SMT
![Page 73: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/73.jpg)
73
The programmer view of a CMP/SMT !
![Page 74: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/74.jpg)
74
Why CMP/SMT is the new frontier ?
(Most) applications were sequential: Hardware WILL be parallel Tens, hundreds of SMT cores in your PC, PDA in 10 years
from now (might be
Option 1: Applications will have to be adapted to parallelism Option 2: // hardware will have to run efficiently sequential
applications Option 3: invent new tradeoffs
There is no current standard
![Page 75: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/75.jpg)
75
Embedded processing and on-chip parallelism (1)
ILP has been exploited for many years: DSPs: a multiply-add, 1 or 2 loads, loop control in a single
(long) cycle
Caches were implemented on embedded processors +10 years
VLIW on embedded processor was introduced in1997-98
In-order supercsalar processor were developped for embedded market 15 years ago (Intel i960)
![Page 76: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/76.jpg)
76
Embedded processing and on-chip parallelism (2): thread parallelism
Heterogeneous multicores is the trend: Cell processor IBM (2005)
• One PowerPC (master) • 8 special purpose processors (slaves)
Philips Nexperia: A RISC MIPS microprocessor + a VLIW Trimedia processor
ST Nomadik: An ARM + x VLIW
![Page 77: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/77.jpg)
77
What about a vector microprocessor for scientific computing?
A niche segment
Caches are not vectors friendly
Not so small: 2B$ !
never heard about:L2 cache, prefetching, blocking ?
GPGPU !! GPU boards have becomeVector processing units
Vector parallelism is well understood !
![Page 78: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/78.jpg)
78
Structure of future multicores ?
μP $ μP $ μP $ μP $
μP $ μP $ μP $ μP $
μP $ μP $ μP $ μP $
L3 cache
![Page 79: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/79.jpg)
79
Hierarchical organization ?
μP $ μP $
L2 $
μP $ μP $
L2 $
μP $ μP $
L2 $
μP $ μP $
L2 $
L3 $
![Page 80: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/80.jpg)
80
An example of sharing
μP FP μP
DL1 $
Inst. fetch
IL1 $
μP FP μP
DL1 $
Inst. fetch
IL1 $
Har
dw
are
acce
lera
tor
L2 cache
![Page 81: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/81.jpg)
81
A possible basic brick
μP FP μP
D $
I $
μP FP μP
D $
I $
L2 cache
μP FP μP
D $
I $
μP FP μP
D $
I $
![Page 82: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/82.jpg)
82
Memory interface
network interface
System interface
L3 cache
μP FP μP
D $
I $
μP FP μP
D $
I $
L2 cache
μP FP μP
D $
I $
μP FP μP
D $
I $
μP FP μP
D $
I $
μP FP μP
D $
I $
L2 cache
μP FP μP
D $
I $
μP FP μP
D $
I $
μP FP μP
D $
I $
μP FP μP
D $
I $
L2 cache
μP FP μP
D $
I $
μP FP μP
D $
I $
μP FP μP
D $
I $
μP FP μP
D $
I $
L2 cache
μP FP μP
D $
I $
μP FP μP
D $
I $
![Page 83: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/83.jpg)
83
Only limited available thread parallelism ?
Focus on uniprocessor architecture: Find the correct tradeoff between complexity and
performance Power and temperature issues
Vector extensions ? Contiguous vectors ( a la SSE) ? Strided vectors in L2 caches ( Tarantula-like)
![Page 84: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/84.jpg)
84
Another possible basic brick
L2 cache
μP FP μP
D $
I $
μP FP μP
D $
I $
Ultimate Out-of-order Superscalar
![Page 85: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/85.jpg)
85
L2 $
D $
I $
D $
I $
Ult. O-O-O
L2 $
D $
I $
D $
I $
Ult. O-O-O
L2 $
D $
I $
D $
I $
Ult. O-O-O
L2 $
D $
I $
D $
I $
Ult. O-O-O
L3 cache
Memory interface
network interface
System interface
![Page 86: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/86.jpg)
86
Some undeveloped issues :the power consumption issue
Power consumption: Need to limit power consumption
• Labtop: battery life• Desktop: above 75 W, hard to extract at low cost• Embedded: battery life, environment
Revisit the old performance concept with:• “maximum performance in a fixed power budget”
Power aware architecture design Need to limit frequency
![Page 87: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/87.jpg)
87
Some undeveloped issues :the performance predictabilty issue
Modern microprocessors have unpredictable/unstable performance
The average user wants stable performance: Cannot tolerate variations of performance by orders of
magnitude when varying simple parameters
Real time systems: Want to guarantee response time.
![Page 88: 1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.](https://reader037.fdocuments.us/reader037/viewer/2022110210/56649e6f5503460f94b6c683/html5/thumbnails/88.jpg)
88
Some undeveloped issues :the temperature issue
Temperature is not uniform on the chip (hotspots)
Rising the temperature above a threshold has devastating effects:
• Defectuous behavior: transcient or definitive• Component aging
Solutions: gating (stops !!) or clock scaling, or task migration