Mapping Task Graphs to Processors in Large Multiprocessor Systems Mapping Task Graphs to Processors...
-
date post
21-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of Mapping Task Graphs to Processors in Large Multiprocessor Systems Mapping Task Graphs to Processors...
Mapping Task Graphs to Processors in LargeMapping Task Graphs to Processors in LargeMultiprocessor SystemsMultiprocessor Systems
Kurt KeutzerKurt Keutzer
and the MESCAL Teamand the MESCAL Team
especiallyespecially
Yujia Jin, Kaushik Ravindran, and N. R. SatishYujia Jin, Kaushik Ravindran, and N. R. Satish
04/18/23 2
FromDevice(0)Discard
ToDevice(0)
FromDevice(1)
FromDevice(2)
FromDevice(3)
Discard
ToDevice(1)
ToDevice(2)
ToDevice(3)
Discard
…
FromDevice(15)
LookupIPRoute
ToDevice(15)
… …
IPVerify DecIPTTL
DiscardDiscard
IPVerifyDecIPTTL
Discard
DiscardIPVerify
DecIPTTL
…
Discard
DecIPTTL
Discard
DecIPTTL
Design Space Exploration FlowDesign Space Exploration Flow
MicroBlaze (soft)
FSL
OPB
PLB
Hardware acceleration
Ethernet
Off-chip SDRAM
On-chip BRAM
PECo-PE PECo-PE
MEM MEMMEM PECo-PE
MEM
PERIPHERALMEM
Multiprocessorplatform
Application Application descriptiondescription
PerformancePerformanceAnalysisAnalysis
PerformancePerformanceNumbersNumbers
Task graph
HW/SW generation
Implementation
Task Graph + profiles
Allocation/SchedulingPlatform
ConstraintsSchedulingConstraints
S1
R1 L1 T1
R2 L2 T2
S2
04/18/23 3
Investigative ApproachInvestigative Approach
Demonstrate network applications on FPGA-based soft Demonstrate network applications on FPGA-based soft
multiprocessorsmultiprocessorsTomahawk exploration frameworkTomahawk exploration framework
Automated task allocation and schedulingAutomated task allocation and scheduling
Extend framework to large multiprocessor systemsExtend framework to large multiprocessor systems1000’s-10,000’s of tasks1000’s-10,000’s of tasks
100’s-1000’s of PE’s 100’s-1000’s of PE’s
RAMPRAMP
04/18/23 4
What Is a FPGA-based Soft Multiprocessor SystemWhat Is a FPGA-based Soft Multiprocessor System
A network of architecture building A network of architecture building
blocks on an FPGAblocks on an FPGA
Multiprocessor architecture customized Multiprocessor architecture customized
for target applicationfor target application Number of processorsNumber of processors Interconnection networkInterconnection network Memory hierarchyMemory hierarchy Custom co-processorsCustom co-processors
Cost reduction by avoiding custom Cost reduction by avoiding custom
siliconsilicon
Productivity gains due to software Productivity gains due to software
abstractionabstraction
ProcessingElement
ProcessingElement
Co-Processor
Memory
Architecture Building Blocks
BusQueue
Xilinx Virtex-II Pro, Virtex-IV family of
FPGAs
PowerPC (hard)
MicroBlaze (soft)
FSL
OPB
PLB
Hardware acceleration
EthernetOff-chip SDRAM
On-chip BRAM
PECo-PE PE Co-PE
MEM MEM
MEM PE Co-PE
MEM
PERIPHERALMEM
Multiprocessor Configuration
Blaze(soft)PowerPC(hard)
Hash engineCrypto engine
BRAM(on-chip)SDRAM(off-chip)
FSL OPBPLB
04/18/23 5
Obstacles to Their Adoption: Hard to designObstacles to Their Adoption: Hard to design
Complex micro-architecture design space Complex micro-architecture design space Processor choicesProcessor choices
Memory hierarchyMemory hierarchy
Communication topologyCommunication topology
Difficult mapping decisionsDifficult mapping decisions assigning computation to processing elementsassigning computation to processing elements
data to exposed heterogeneous memories data to exposed heterogeneous memories
To unlock potential of these systems, tools enabling efficiency and To unlock potential of these systems, tools enabling efficiency and
productivity are neededproductivity are needed
04/18/23 6
Makespan = 60
P2P1
R1 L1 T1
R2 L2 T2
Total time = 50
Total time = 60
Optimal Design
Makespan = 70
P2P1
R2 L2 T2
Total time = 70 Total time = 40
Design BR1 L1 T1
Explore
Example: Design DifficultyExample: Design Difficulty
2020 1010
2020 2020
3030 1010
R
L
T
R1 L1 T1
R2 L2 T2
Application Task Graph
Execution Time (cycles)
P1P1 P2P2
10
Architecture Model
P1P1 P2P2 QueueProfile
Makespan = 80
P1P2
R1 L1 T1
Total time = 80 Total time = 80
Design A
R2 L2 T2
04/18/23 7
Tomahawk: Network Applications onto Soft MPs Tomahawk: Network Applications onto Soft MPs FromDevice(0)
Discard
ToDevice(0)
FromDevice(1)
FromDevice(2)
FromDevice(3)
Discard
ToDevice(1)
ToDevice(2)
ToDevice(3)
Discard
…
FromDevice(15)
LookupIPRoute
ToDevice(15)
… …
IPVerify DecIPTTL
Discard
Discard
IPVerify
DecIPTTL
Discard
DiscardIPVerify
DecIPTTL
…
Discard
DecIPTTL
Discard
DecIPTTLClick
Xilinx 2VP50 FPGA
C programs and micro architecture
specification
MicroBlaze (soft)
FSL
OPB
PLB
Hardware acceleration
EthernetOff-chip SDRAM
On-chip BRAM
PECo-PE PE Co-PE
MEM MEM
MEM PE Co-PE
MEM
PERIPHERALMEM
Task graph
Automated micro-architecture configuration
Automated Mapping
P1P1 P2
P2
M1
R1 L1 T1
R2 L2 T2
S1 S2
S1
R1 L1 T1
R2 L2 T2
S2
04/18/23 8
Possible Approaches for Automated ExplorationPossible Approaches for Automated Exploration Randomized algorithmsRandomized algorithms
probabilistic bounds, simulated annealingprobabilistic bounds, simulated annealing
Heuristic methodsHeuristic methods list scheduling, force directed schedulinglist scheduling, force directed scheduling
Exact methodsExact methods enumeration and tabu search, branch-and-boundenumeration and tabu search, branch-and-bound
Limitations of these approachesLimitations of these approaches Specific implementation constraints are hard to enforceSpecific implementation constraints are hard to enforce Most approaches require per-instance tuning and are hard to generalize – therefore Most approaches require per-instance tuning and are hard to generalize – therefore
poor for design space explorationpoor for design space exploration
04/18/23 9
Constraint Optimization Techniques for Automated Constraint Optimization Techniques for Automated ExplorationExploration
Constraint solver technologiesConstraint solver technologies Integer linear programming (ILP) solversInteger linear programming (ILP) solvers 0-1 Boolean reasoning solvers (SAT, PB-SAT)0-1 Boolean reasoning solvers (SAT, PB-SAT)
AdvantagesAdvantages Constraint formulations are a formal, yet natural way to capture a mathematical Constraint formulations are a formal, yet natural way to capture a mathematical
optimization problemoptimization problem Implementation constraints specific to a problem can be incorporated easilyImplementation constraints specific to a problem can be incorporated easily Constraint solvers can exhaustively cover a search space without enumerating all Constraint solvers can exhaustively cover a search space without enumerating all
solutionssolutions
Key strategies to improve solver performance: Key strategies to improve solver performance: Decomposition methodsDecomposition methods Variable orderingVariable ordering Improved lower and upper boundsImproved lower and upper bounds Symmetry representationSymmetry representation
04/18/23 10
ILP FormulationILP Formulation
04/18/23 11
Example Application: IPv4 Packet Forwarding Example Application: IPv4 Packet Forwarding Data plane of IPv4 packet forwarding (RFC-1812)Data plane of IPv4 packet forwarding (RFC-1812)
Campus network router, Home routerCampus network router, Home router Medium sized route table (5,000 entries or less)Medium sized route table (5,000 entries or less) Route table small enough to fit in on-chip memoryRoute table small enough to fit in on-chip memory
Target platformTarget platform Xilinx Virtex-II Pro 2VP50 FPGAXilinx Virtex-II Pro 2VP50 FPGA
Architecture LibraryArchitecture Library MicroBlazes, PowerPC, on-chip Block RAM, IBM CoreConnect buses, queue MicroBlazes, PowerPC, on-chip Block RAM, IBM CoreConnect buses, queue
Lookup next-hop
(prefix match)
Receive IPv4 packet
Verify version, checksum and
TTL
Updatechecksum and TTL
Transmit IPv4 packet
Header
Payload
Header
Ingress Egress
Route Table
Lookup: inspect destination address and find next hop
–Longest prefix match–Implementation
determined by route distribution, memory and performance constraints
04/18/23 12
Hand-tuned Multiprocessor Design for IPv4 ForwardingHand-tuned Multiprocessor Design for IPv4 Forwarding
Achieved 1.8 Gbps throughput for header processingAchieved 1.8 Gbps throughput for header processing using 12 MicroBlaze processorsusing 12 MicroBlaze processors
Verifyver & ttl
checksumLookup1
Verifyver & ttl
checksumLookup1
Verifyver & ttl
checksumLookup1
Verifyver & ttl
checksumLookup1
RouteTable
From source
MicroBlaze 1
From source
MicroBlaze 2
To source
MicroBlaze 1
To source
MicroBlaze 2
To source
MicroBlaze 2
To source
MicroBlaze 1
Key:
MicroBlaze
Block RAM
Bus
Queue
Lookup2
Lookup2
Lookup2
Lookup2
RouteTable
To source
MicroBlaze 1
04/18/23 13
Improved Design after Automated ExplorationImproved Design after Automated Exploration
Resulting design achieved 2.0 Gbps throughput Resulting design achieved 2.0 Gbps throughput surpassing performance of a 1.8 Gbps hand-tuned designsurpassing performance of a 1.8 Gbps hand-tuned design using one less MicroBlaze processorusing one less MicroBlaze processor
The improvement was due to a less regular configuration and balanced workload of tasks The improvement was due to a less regular configuration and balanced workload of tasks across the processorsacross the processors
Lookup1
Verifyver& ttl
RouteTable
Lookup2 Lookup3
Verifychecksum
Lookup1
Verifyver& ttl
Lookup2
Lookup1
Verifyver& ttl
Lookup2
Lookup1
Verifyver& ttl
Lookup2
Lookup3
Verifychecksum
Lookup3
Verifychecksum
RouteTable
RouteTable
From source
MicroBlaze 1
From source
MicroBlaze 2
To source
MicroBlaze 1
To source
MicroBlaze 2
To source
MicroBlaze 2
To source
MicroBlaze 1
Key:
MicroBlaze
Block RAM
Bus
Queue
04/18/23 14
Justifying constraint optimization techniquesJustifying constraint optimization techniques
Our constraint optimization method can handle instances of the Our constraint optimization method can handle instances of the
representative allocation and scheduling problem with up to representative allocation and scheduling problem with up to
100’s of tasks onto 10’s of PE’s100’s of tasks onto 10’s of PE’s
Implementation constraints can be easily incorporatedImplementation constraints can be easily incorporated Task groupingsTask groupings
Multiprocessor topology restrictionsMultiprocessor topology restrictions
Preferred allocationsPreferred allocations
Memory assignmentsMemory assignments
Mutual exclusionMutual exclusion
04/18/23 15
Following Moore’s LawFollowing Moore’s Law
Explore
On-chip network
PE
MNI
PE
MNI
PE
MNI
PE
MNI
PE
MNI
PE
MNI
PE
MNI
PE
MNI
Extend to more complex Extend to more complex
applications applications 1000’s-10,000’s of tasks1000’s-10,000’s of tasks
Extend to bigger Extend to bigger
multiprocessor systemsmultiprocessor systems100’s-1000’s of PE’s 100’s-1000’s of PE’s
04/18/23 16
What can we do for RAMP?What can we do for RAMP?
Challenges in deploying concurrent applications on a RAMP systemChallenges in deploying concurrent applications on a RAMP system Task allocation and scheduling across 100’s – 1000’s of PEsTask allocation and scheduling across 100’s – 1000’s of PEs
Fast mapping step to enable efficient design space explorationFast mapping step to enable efficient design space exploration
Our optimization techniques for static task allocation and scheduling Our optimization techniques for static task allocation and scheduling
are a first step to address these challengesare a first step to address these challenges A “compile-time” tool to guide the designer to explore efficient mappingsA “compile-time” tool to guide the designer to explore efficient mappings
Flexible formulation to target diverse multiprocessorsFlexible formulation to target diverse multiprocessors
Research in progress to extend our techniques to work on problems in the Research in progress to extend our techniques to work on problems in the scale of RAMP systemsscale of RAMP systems
04/18/23 17
Backup SlidesBackup Slides
04/18/23 18
ExampleExample
Optimal design found in less Optimal design found in less
than 6 seconds on 400MHz than 6 seconds on 400MHz
Sparc IISparc II
Architecture
P11
P11 P2
1P2
1
P12
P12
M1
MicroBlazes
Power PC
BRAMs
Communication
FSLs Bus
2VP50
Optimal design
explore
Application
04/18/23 19
Following Moore’s LawFollowing Moore’s Law
Extend to more complex applications Extend to more complex applications 1000’s-10,000’s of tasks1000’s-10,000’s of tasks
DSLAMDSLAM
Extend to bigger multiprocessor systemsExtend to bigger multiprocessor systems100’s-1000’s of PE’s 100’s-1000’s of PE’s
RAMPRAMP
04/18/23 20
Challenges in Automated ExplorationChallenges in Automated Exploration
Higher exploration complexityHigher exploration complexity Increases by 2 orders of magnitude Increases by 2 orders of magnitude
More emphasis on communicationMore emphasis on communicationArbitration modelingArbitration modeling
Routing constraints due to network topology Routing constraints due to network topology
Statistical cost model for dynamic behaviorStatistical cost model for dynamic behavior
04/18/23 21
Potential Approaches to Address these ChallengesPotential Approaches to Address these Challenges
Additional constraints can be easily added to incorporate Additional constraints can be easily added to incorporate
new featuresnew features
Constraint solver performance will slow down and thus Constraint solver performance will slow down and thus
become the bottleneckbecome the bottleneck
Some strategies to improve constraint solver performanceSome strategies to improve constraint solver performanceTask graph based structural decompositionsTask graph based structural decompositions
Relaxation heuristicsRelaxation heuristics
Symmetry representationSymmetry representation
Cutting planes and valid inequalitiesCutting planes and valid inequalities
04/18/23 22
PE
Processorinterconnect
Memory
Network Interface
Key
On-chip network
PE
M
NI
PE
M
NI
PE
M
NI
PE
M
NI
PE
M
NI
PE
M
NI
PE
M
NI
PE
M
NI