Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for...
Transcript of Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for...
![Page 1: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/1.jpg)
Ajay K. Verma, Philip Brisk and Paolo Ienne
Processor Architecture Laboratory (LAP)& Centre for Advanced Digital Systems (CSDA)
Ecole Polytechnique Fédérale de Lausanne (EPFL)
csda
csda
Fast, Quasi-Optimal, and Pipelined Fast, Quasi-Optimal, and Pipelined Instruction-Set ExtensionsInstruction-Set Extensions
![Page 2: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/2.jpg)
2
Custom ISE IdentificationCustom ISE Identification
Register File
ALU MUL LD/ST
Data Memory
AFUout1 = F (in1, in2, in3, in4)out2 = G (in1, in2, in3, in4)
Limited number ofI/O ports
![Page 3: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/3.jpg)
3
OutlineOutline
Problem formulation ISE selection I/O serialisation
Related work
Non-optimality of earlier work
Integer Linear Programming (ILP) formulation
Results
Conclusions
![Page 4: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/4.jpg)
4
Problem FormulationProblem Formulation Given
a dataflow graph
a set of forbidden nodes
Find a subgraph S, which isconvex free of
forbidden nodes
And, has largest gainM (S) =
Nexec * (SW (S) – HW (S))
f
a
x2
x1 d
x3
h
b c e g
![Page 5: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/5.jpg)
5
Convex SubgraphConvex Subgraph
d
cb
a
In order to execute the AFU we need the output of node b
Computation of node b requires the output of AFU
A non-convex AFU cannot be scheduled without creating a deadlock
![Page 6: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/6.jpg)
6
I/O SerialisationI/O Serialisation
f
d
b c e
2 inputs, 4 outputsAvailable I/O ports: (1, 2)
cb
e
d
f
![Page 7: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/7.jpg)
7
ISE Merit EstimationISE Merit Estimation
M (S) = Nexec * (SW (S) – HW (S))
f
a
x2
x1 d
x3
h
b c e g
cb
e
d
f
![Page 8: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/8.jpg)
8
Related WorkRelated Work ISE identification under I/O constraints
Search space pruning using I/O and convexity constraints [Atasu03, Clark03, Yu04, Pozzi06, Yu07, Chen07]
ILP based approach [Atasu05] Pseudo-polynomial time algorithm [Bonzini07]
ISE identification under relaxed I/O constraints Restricted search space exploration [Pozzi05] Generation of a semi compact set of connected ISEs
[Pothineni07]
I/O serialisation Exponential time algorithms [Pozzi05, Pothineni07]
Algorithms for specific processor models Single-issue RISC processor model [Verma07]
![Page 9: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/9.jpg)
9
Earlier WorkEarlier Work
ISE Selection I/O Serialisation
Atasu03
Yu07
Chen07
Bonzini07
Pozzi05
Pothineni07
Optimal ISEs selection undervarious I/O constraints
Exponential time I/O serialisation algorithm
![Page 10: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/10.jpg)
10
Non-Optimality of Earlier WorkNon-Optimality of Earlier Work
.5
.6
.5
.6
.5
.6
.3
.2
.5
.6
.5
.6
.5
.6
.3
.2
cycle saved:
23.36
cycle saved:
15.02
cycle saved: 066
cycle saved: 112
![Page 11: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/11.jpg)
11
Our ContributionsOur Contributions
Optimal ILP formulation for a large class of processor modelsEarlier work consider RISC processor model only
Single run In the earlier work ISE selection was done for
various I/O constraints
ISE selection and I/O scheduling togetherAnother source of non-optimality of earlier work
![Page 12: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/12.jpg)
12
Integer Linear ProgrammingInteger Linear Programming
Objective function
Linear constraints
![Page 13: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/13.jpg)
13
ILP FormulationILP Formulation
Linear constraintsNo forbidden nodesConvexity constraints I/O serialisation based constraints I/O access per cycle based constraints
Objective functionSaving in cycles should be maximum
![Page 14: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/14.jpg)
14
ISE Selection Constraints (1 of 2)ISE Selection Constraints (1 of 2) Variable: For each node ni a Boolean variable xi
xi is true iff node ni is in the selected ISE
Constraint: No forbidden node should be in the ISE If ni is a forbidden node, then xi = 0
Variable: For each node ni two Boolean variables pi and si
pi (si) is true iff at least a predecessor (successor) of ni is in the selected ISE
Constraint: Subgraph corresponding to the selected ISE must be convex If (pi and si are true), then xi must be true (i.e., pi + si – xi ≤
1)
![Page 15: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/15.jpg)
15
ISE Selection Constraints (2 of 2)ISE Selection Constraints (2 of 2)
Relationship between pi, si and xi
pi = 0 if ni has no children
U (xj U pj) where nj’s are children of ni
si = 0 if ni has no parents
U (xj U pj) where nj’s are parents of ni
![Page 16: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/16.jpg)
16
I/O Serialisation Based Constraints (1 I/O Serialisation Based Constraints (1 of 3)of 3)
n1 n2
n3
n4
n5
Variable: An integer variable intDelayi
Denotes the cycle in which node ni is executed, e.g.,
intDelay1 = 0 intDelay4 = 1 intDelay5 = 2
Variable: A real variable fractionalDelayi Denotes the smallest time after
intDelayi cycle when output of ni are available, e.g.,
fractionalDelay3 = HW (n3) fractionalDelay4 = HW (n3) + HW (n4)
Variable: An integer variable ρij Denotes the number of stages across
the edges between the nodes ni and nj , e.g.,
ρ13 = 1 ρ34 = 0 ρ25 = 2
![Page 17: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/17.jpg)
17
I/O Serialisation Based Constraints (2 I/O Serialisation Based Constraints (2 of 3)of 3)
Constraint: The difference between the cycles of predecessor and successor node is the same as number of latches on the edge connecting them, e.g., intDelay4 = intDelay3 +
ρ34
intDelay5 = intDelay2 + ρ25
Constraint: The total number of stages is the same as the last cycle in which an output node is computed, e.g., R = intDelay5 + ρ57 R = intDelay2 + ρ26
n1 n2
n3
n4
n5
n6n7
Extra latches on output edges are createdin order to realize an imaginary sink node
![Page 18: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/18.jpg)
18
I/O Serialisation Based Constraints (3 I/O Serialisation Based Constraints (3 of 3)of 3)
Constraint: fractionalDelay of a node depends on the fractionalDelay of its predecessor nodes, e.g., Case 1: if node is the first node
in the cycle fractionalDelay3 = HW (n3)
Case 2: if node is not the first node in the cycle
fractionalDelay4 = fractionalDelay3 + HW (n4)
Constraint: fractionalDelay of a node should never exceed the cycle time, e.g., fractionalDelay3 ≤ λ fractionalDelay4 ≤ λ
n1 n2
n3
n4
n5
n6n7
![Page 19: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/19.jpg)
19
I/O Access Per Cycle Based I/O Access Per Cycle Based Constraints Constraints
Variable: Boolean variables cikIN and cik
OUT
cikIN is true, iff ni is an input of ISE and is accessed in the
kth stage of execution (similarly for cikOUT)
Constraint: In each stage no more than m inputs should be accessed, and no more than n outputs should be written back, i.e., for each k ∑ cik
IN ≤ m
∑ cikOUT ≤ n
cikIN and cik
OUT can be computed using the intDelay, fractionalDelay of nodes and ρ values of incoming and outgoing edges of the AFU
![Page 20: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/20.jpg)
20
Objective FunctionObjective Function
Saving in cycles should be maximized SW (S) – HW (S) should be maximum
SW (S) = ∑ xi SW (ni)
HW (S) = R
Any processor model where SW (S) and HW (S) can becomputed using linear inequalities, can be handled using ILP
![Page 21: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/21.jpg)
21
Experimental SetupExperimental Setup
Input dataflowgraph
ISE selectionAtasu03
ISE selectionAtasu03
ILP method
I/O serialisationPozzi05
No serialisation
exp / subopt
exp / opt
![Page 22: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/22.jpg)
22
Results (1 of 3)Results (1 of 3)
viterbi
adpcmdecoder adpcmcoder
No pipelining
Pozzi’s algorithm
ILP method
![Page 23: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/23.jpg)
23
Results (2 of 3)Results (2 of 3)
Pozzi’s algorithm takes several hours on this benchmark, and produces inferior results
Benchmark: aes
Biggest dataflow graph: 703
After 3 minutes After an hour
![Page 24: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/24.jpg)
24
Results (3 of 3)Results (3 of 3)
The best AFU with 22 inputs and 22 outputs
![Page 25: Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.](https://reader030.fdocuments.us/reader030/viewer/2022032707/56649e035503460f94aede8e/html5/thumbnails/25.jpg)
25
ConclusionsConclusions
ISE Selection I/O Serialisation
Atasu03
Yu07
Chen07
Bonzini07
Pozzi05
Pothineni07
The methodology can be generalized for a large class of processor models
Optimal, single run algorithm