Mapping Task Graphs to Processors in Large Multiprocessor Systems Mapping Task Graphs to Processors...

Mapping Task Graphs to Processors in LargeMapping Task Graphs to Processors in LargeMultiprocessor SystemsMultiprocessor Systems

Kurt KeutzerKurt Keutzer

and the MESCAL Teamand the MESCAL Team

especiallyespecially

Yujia Jin, Kaushik Ravindran, and N. R. SatishYujia Jin, Kaushik Ravindran, and N. R. Satish

04/18/23 2

FromDevice(0)Discard

ToDevice(0)

FromDevice(1)

FromDevice(2)

FromDevice(3)

Discard

ToDevice(1)

ToDevice(2)

ToDevice(3)

Discard

…

FromDevice(15)

LookupIPRoute

ToDevice(15)

… …

IPVerify DecIPTTL

DiscardDiscard

IPVerifyDecIPTTL

Discard

DiscardIPVerify

DecIPTTL

…

Discard

DecIPTTL

Discard

DecIPTTL

Design Space Exploration FlowDesign Space Exploration Flow

MicroBlaze (soft)

FSL

OPB

PLB

Hardware acceleration

Ethernet

Off-chip SDRAM

On-chip BRAM

PECo-PE PECo-PE

MEM MEMMEM PECo-PE

MEM

PERIPHERALMEM

Multiprocessorplatform

Application Application descriptiondescription

PerformancePerformanceAnalysisAnalysis

PerformancePerformanceNumbersNumbers

Task graph

HW/SW generation

Implementation

Task Graph + profiles

Allocation/SchedulingPlatform

ConstraintsSchedulingConstraints

S1

R1 L1 T1

R2 L2 T2

S2

04/18/23 3

Investigative ApproachInvestigative Approach

Demonstrate network applications on FPGA-based soft Demonstrate network applications on FPGA-based soft

multiprocessorsmultiprocessorsTomahawk exploration frameworkTomahawk exploration framework

Automated task allocation and schedulingAutomated task allocation and scheduling

Extend framework to large multiprocessor systemsExtend framework to large multiprocessor systems1000’s-10,000’s of tasks1000’s-10,000’s of tasks

100’s-1000’s of PE’s 100’s-1000’s of PE’s

RAMPRAMP

04/18/23 4

What Is a FPGA-based Soft Multiprocessor SystemWhat Is a FPGA-based Soft Multiprocessor System

A network of architecture building A network of architecture building

blocks on an FPGAblocks on an FPGA

Multiprocessor architecture customized Multiprocessor architecture customized

for target applicationfor target application Number of processorsNumber of processors Interconnection networkInterconnection network Memory hierarchyMemory hierarchy Custom co-processorsCustom co-processors

Cost reduction by avoiding custom Cost reduction by avoiding custom

siliconsilicon

Productivity gains due to software Productivity gains due to software

abstractionabstraction

ProcessingElement

ProcessingElement

Co-Processor

Memory

Architecture Building Blocks

BusQueue

Xilinx Virtex-II Pro, Virtex-IV family of

FPGAs

PowerPC (hard)

MicroBlaze (soft)

FSL

OPB

PLB


EthernetOff-chip SDRAM

On-chip BRAM

PECo-PE PE Co-PE

MEM MEM

MEM PE Co-PE

MEM

PERIPHERALMEM

Multiprocessor Configuration

Blaze(soft)PowerPC(hard)

Hash engineCrypto engine

BRAM(on-chip)SDRAM(off-chip)

FSL OPBPLB

04/18/23 5

Obstacles to Their Adoption: Hard to designObstacles to Their Adoption: Hard to design

Complex micro-architecture design space Complex micro-architecture design space Processor choicesProcessor choices

Memory hierarchyMemory hierarchy

Communication topologyCommunication topology

Difficult mapping decisionsDifficult mapping decisions assigning computation to processing elementsassigning computation to processing elements

data to exposed heterogeneous memories data to exposed heterogeneous memories

To unlock potential of these systems, tools enabling efficiency and To unlock potential of these systems, tools enabling efficiency and

productivity are neededproductivity are needed

04/18/23 6

Makespan = 60

P2P1

R1 L1 T1

R2 L2 T2

Total time = 50

Total time = 60

Optimal Design

Makespan = 70

P2P1

R2 L2 T2

Total time = 70 Total time = 40

Design BR1 L1 T1

Explore

Example: Design DifficultyExample: Design Difficulty

2020 1010

2020 2020

3030 1010

R

L

T

R1 L1 T1

R2 L2 T2

Application Task Graph

Execution Time (cycles)

P1P1 P2P2

10

Architecture Model

P1P1 P2P2 QueueProfile

Makespan = 80

P1P2

R1 L1 T1

Total time = 80 Total time = 80

Design A

R2 L2 T2

04/18/23 7

Tomahawk: Network Applications onto Soft MPs Tomahawk: Network Applications onto Soft MPs FromDevice(0)

Discard

ToDevice(0)

FromDevice(1)

FromDevice(2)

FromDevice(3)

Discard

ToDevice(1)

ToDevice(2)

ToDevice(3)

Discard

…

FromDevice(15)

LookupIPRoute

ToDevice(15)

… …

IPVerify DecIPTTL

Discard

Discard

IPVerify

DecIPTTL

Discard

DiscardIPVerify

DecIPTTL

…

Discard

DecIPTTL

Discard

DecIPTTLClick

Xilinx 2VP50 FPGA

C programs and micro architecture

specification

MicroBlaze (soft)

FSL

OPB

PLB


EthernetOff-chip SDRAM

On-chip BRAM

PECo-PE PE Co-PE

MEM MEM

MEM PE Co-PE

MEM

PERIPHERALMEM

Task graph

Automated micro-architecture configuration

Automated Mapping

P1P1 P2

P2

M1

R1 L1 T1

R2 L2 T2

S1 S2

S1

R1 L1 T1

R2 L2 T2

S2

04/18/23 8

Possible Approaches for Automated ExplorationPossible Approaches for Automated Exploration Randomized algorithmsRandomized algorithms

probabilistic bounds, simulated annealingprobabilistic bounds, simulated annealing

Heuristic methodsHeuristic methods list scheduling, force directed schedulinglist scheduling, force directed scheduling

Exact methodsExact methods enumeration and tabu search, branch-and-boundenumeration and tabu search, branch-and-bound

Limitations of these approachesLimitations of these approaches Specific implementation constraints are hard to enforceSpecific implementation constraints are hard to enforce Most approaches require per-instance tuning and are hard to generalize – therefore Most approaches require per-instance tuning and are hard to generalize – therefore

poor for design space explorationpoor for design space exploration

04/18/23 9

Constraint Optimization Techniques for Automated Constraint Optimization Techniques for Automated ExplorationExploration

Constraint solver technologiesConstraint solver technologies Integer linear programming (ILP) solversInteger linear programming (ILP) solvers 0-1 Boolean reasoning solvers (SAT, PB-SAT)0-1 Boolean reasoning solvers (SAT, PB-SAT)

AdvantagesAdvantages Constraint formulations are a formal, yet natural way to capture a mathematical Constraint formulations are a formal, yet natural way to capture a mathematical

optimization problemoptimization problem Implementation constraints specific to a problem can be incorporated easilyImplementation constraints specific to a problem can be incorporated easily Constraint solvers can exhaustively cover a search space without enumerating all Constraint solvers can exhaustively cover a search space without enumerating all

solutionssolutions

Key strategies to improve solver performance: Key strategies to improve solver performance: Decomposition methodsDecomposition methods Variable orderingVariable ordering Improved lower and upper boundsImproved lower and upper bounds Symmetry representationSymmetry representation

04/18/23 10

ILP FormulationILP Formulation

04/18/23 11

Example Application: IPv4 Packet Forwarding Example Application: IPv4 Packet Forwarding Data plane of IPv4 packet forwarding (RFC-1812)Data plane of IPv4 packet forwarding (RFC-1812)

Campus network router, Home routerCampus network router, Home router Medium sized route table (5,000 entries or less)Medium sized route table (5,000 entries or less) Route table small enough to fit in on-chip memoryRoute table small enough to fit in on-chip memory

Target platformTarget platform Xilinx Virtex-II Pro 2VP50 FPGAXilinx Virtex-II Pro 2VP50 FPGA

Architecture LibraryArchitecture Library MicroBlazes, PowerPC, on-chip Block RAM, IBM CoreConnect buses, queue MicroBlazes, PowerPC, on-chip Block RAM, IBM CoreConnect buses, queue

Lookup next-hop

(prefix match)

Receive IPv4 packet

Verify version, checksum and

TTL

Updatechecksum and TTL

Transmit IPv4 packet

Header

Payload

Header

Ingress Egress

Route Table

Lookup: inspect destination address and find next hop

–Longest prefix match–Implementation

determined by route distribution, memory and performance constraints

04/18/23 12

Hand-tuned Multiprocessor Design for IPv4 ForwardingHand-tuned Multiprocessor Design for IPv4 Forwarding

Achieved 1.8 Gbps throughput for header processingAchieved 1.8 Gbps throughput for header processing using 12 MicroBlaze processorsusing 12 MicroBlaze processors

Verifyver & ttl

checksumLookup1

Verifyver & ttl

checksumLookup1

Verifyver & ttl

checksumLookup1

Verifyver & ttl

checksumLookup1

RouteTable

From source

MicroBlaze 1

From source

MicroBlaze 2

To source

MicroBlaze 1

To source

MicroBlaze 2

To source

MicroBlaze 2

To source

MicroBlaze 1

Key:

MicroBlaze

Block RAM

Bus

Queue

Lookup2

Lookup2

Lookup2

Lookup2

RouteTable

To source

MicroBlaze 1

04/18/23 13

Improved Design after Automated ExplorationImproved Design after Automated Exploration

Resulting design achieved 2.0 Gbps throughput Resulting design achieved 2.0 Gbps throughput surpassing performance of a 1.8 Gbps hand-tuned designsurpassing performance of a 1.8 Gbps hand-tuned design using one less MicroBlaze processorusing one less MicroBlaze processor

The improvement was due to a less regular configuration and balanced workload of tasks The improvement was due to a less regular configuration and balanced workload of tasks across the processorsacross the processors

Lookup1

Verifyver& ttl

RouteTable

Lookup2 Lookup3

Verifychecksum

Lookup1

Verifyver& ttl

Lookup2

Lookup1

Verifyver& ttl

Lookup2

Lookup1

Verifyver& ttl

Lookup2

Lookup3

Verifychecksum

Lookup3

Verifychecksum

RouteTable

RouteTable

From source

MicroBlaze 1

From source

MicroBlaze 2

To source

MicroBlaze 1

To source

MicroBlaze 2

To source

MicroBlaze 2

To source

MicroBlaze 1

Key:

MicroBlaze

Block RAM

Bus

Queue

04/18/23 14

Justifying constraint optimization techniquesJustifying constraint optimization techniques

Our constraint optimization method can handle instances of the Our constraint optimization method can handle instances of the

representative allocation and scheduling problem with up to representative allocation and scheduling problem with up to

100’s of tasks onto 10’s of PE’s100’s of tasks onto 10’s of PE’s

Implementation constraints can be easily incorporatedImplementation constraints can be easily incorporated Task groupingsTask groupings

Multiprocessor topology restrictionsMultiprocessor topology restrictions

Preferred allocationsPreferred allocations

Memory assignmentsMemory assignments

Mutual exclusionMutual exclusion

04/18/23 15

Following Moore’s LawFollowing Moore’s Law

Explore

On-chip network

PE

MNI

PE

MNI

PE

MNI

PE

MNI

PE

MNI

PE

MNI

PE

MNI

PE

MNI

Extend to more complex Extend to more complex

applications applications 1000’s-10,000’s of tasks1000’s-10,000’s of tasks

Extend to bigger Extend to bigger

multiprocessor systemsmultiprocessor systems100’s-1000’s of PE’s 100’s-1000’s of PE’s

04/18/23 16

What can we do for RAMP?What can we do for RAMP?

Challenges in deploying concurrent applications on a RAMP systemChallenges in deploying concurrent applications on a RAMP system Task allocation and scheduling across 100’s – 1000’s of PEsTask allocation and scheduling across 100’s – 1000’s of PEs

Fast mapping step to enable efficient design space explorationFast mapping step to enable efficient design space exploration

Our optimization techniques for static task allocation and scheduling Our optimization techniques for static task allocation and scheduling

are a first step to address these challengesare a first step to address these challenges A “compile-time” tool to guide the designer to explore efficient mappingsA “compile-time” tool to guide the designer to explore efficient mappings

Flexible formulation to target diverse multiprocessorsFlexible formulation to target diverse multiprocessors

Research in progress to extend our techniques to work on problems in the Research in progress to extend our techniques to work on problems in the scale of RAMP systemsscale of RAMP systems

04/18/23 17

Backup SlidesBackup Slides

04/18/23 18

ExampleExample

Optimal design found in less Optimal design found in less

than 6 seconds on 400MHz than 6 seconds on 400MHz

Sparc IISparc II

Architecture

P11

P11 P2

1P2

1

P12

P12

M1

MicroBlazes

Power PC

BRAMs

Communication

FSLs Bus

2VP50

Optimal design

explore

Application

04/18/23 19

Following Moore’s LawFollowing Moore’s Law

Extend to more complex applications Extend to more complex applications 1000’s-10,000’s of tasks1000’s-10,000’s of tasks

DSLAMDSLAM

Extend to bigger multiprocessor systemsExtend to bigger multiprocessor systems100’s-1000’s of PE’s 100’s-1000’s of PE’s

RAMPRAMP

04/18/23 20

Challenges in Automated ExplorationChallenges in Automated Exploration

Higher exploration complexityHigher exploration complexity Increases by 2 orders of magnitude Increases by 2 orders of magnitude

More emphasis on communicationMore emphasis on communicationArbitration modelingArbitration modeling

Routing constraints due to network topology Routing constraints due to network topology

Statistical cost model for dynamic behaviorStatistical cost model for dynamic behavior

04/18/23 21

Potential Approaches to Address these ChallengesPotential Approaches to Address these Challenges

Additional constraints can be easily added to incorporate Additional constraints can be easily added to incorporate

new featuresnew features

Constraint solver performance will slow down and thus Constraint solver performance will slow down and thus

become the bottleneckbecome the bottleneck

Some strategies to improve constraint solver performanceSome strategies to improve constraint solver performanceTask graph based structural decompositionsTask graph based structural decompositions

Relaxation heuristicsRelaxation heuristics

Symmetry representationSymmetry representation

Cutting planes and valid inequalitiesCutting planes and valid inequalities

04/18/23 22

PE

Processorinterconnect

Memory

Network Interface

Key

On-chip network

PE

M

NI

PE

M

NI

PE

M

NI

PE

M

NI

PE

M

NI

PE

M

NI

PE

M

NI

PE

M

NI

Mapping Task Graphs to Processors in Large Multiprocessor Systems Mapping Task Graphs to Processors...

Documents

Transcript of Mapping Task Graphs to Processors in Large Multiprocessor Systems Mapping Task Graphs to Processors...