Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation...
Transcript of Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation...
Simulation of the Communication Time for aSpace-Time Adaptive Processing Algorithm on a
on a Parallel Embedded System
byJack M. West and John K. Antonio
Texas Tech University{west, antonio}@ttu.edu
EHPC ‘98:Workshop on Embedded HPC Systems and Applications
April 3, 1998
Supported by Rome Laboratory Grant No. F30602-96-1-0098 and DARPA Contract No. F30602-97-2-0297
Outline
• STAP Basics
• Parallelization of STAP
• Impact of Mapping and Scheduling Choices on Communication Delay
• RACE Network Simulator
• Summary
Space-Time Adaptive Processing
• Space-Time Adaptive Processing (STAP) refers to a class of signal processing methods that operate on data collected from a set of sensors over a given time interval.
• STAP simultaneously combines the signals received from an antenna array (spatial domain) and multiple pulse repetition periods (time domain).
• STAP provides improved detection of smaller targets in the presence of ground clutter (overland and littoral environments) and hostile interference (electronic counter measures and jamming).
Typical STAP Processing Flow
Pulses Pulses
Data Cube
Data Cube
Doppler Filter
Channels
Ran
ge
Ran
ge
Channels
Beamform
Beam Outputs
Ran
ge
Pulses
QR Decomposition
Rotate
Channels
Ran
ge
Pulses
Data Cube
Steering Vectors
Weights
Input Data
RotatePulse
Compress
Data CubeC
hann
els
Pulses
Range
Outline
• STAP Basics
• Parallelization of STAP
• Impact of Mapping and Scheduling Choices on Communication Delay
• RACE Network Simulator
• Summary
Sub-Cube Bar Partitioning Parallelization Technique
1. Partition STAP data cube over a 2-D process set.
2. Process the contiguous dimension.
3. Re-partition the data cube before processing the next dimension.
4. Rotate the newly distributed data to make the next dimension sequential in memory.
5. Repeat steps 1 through 4 before each processing phase.
Pulse Compression Partitioningwith range dimension whole.Pulse Compression Partitioningwith range dimension whole.
Illustration of STAP Data Cube Partitioning
Pulses Range
Cha
nnel
s
Cha
nnel
s
1 32 4
5 76 8
9 1110 12
Pulses
+
3 x 4 Process Set
Pulses
5
1
9
Range
Cha
nnel
s
Doppler Filtering Partitioningwith pulses dimension whole.Doppler Filtering Partitioningwith pulses dimension whole.
Pulses Range
Cha
nnel
s
9 10 11 12
5 6 7 8
1 2 3 4
Pulses Range
Cha
nnel
s
+
Cha
nnel
s
1 32 4
5 76 8
9 1110 12
Range
3 x 4 Process Set
STAP Data Cube Re-Partitioning
Pulses
5
1
9
Range
Cha
nnel
s• Re-Partitioning involves exchanging data with the next whole dimension.
Cha
nnel
s
1 32 4
5 76 8
9 1110 12
Pulses
3 x 4 Process Set
Range Dimension is Contiguous
Cha
nnel
s
1 32 4
5 76 8
9 1110 12
Range
3 x 4 Process Set
Pulse Dimension is Contiguous
• Interprocessor Communication is required between processors in the same row.
Pulses
Range
Cha
nnel
s
9 10 11 12
5 6 7 8
1 1 1 2 1 3 1 4
STAP Data Cube Re-PartitioningRange and Doppler Processing
Pulse CompressionPulse Compression
Re-PartitioningApply 3x4 Process Set:Re-PartitioningApply 3x4 Process Set:
Cha
nnel
s
1 32 4
5 76 8
9 1110 12
Range
Pulses
5
1
9
Range
Cha
nnel
s
Doppler FilteringDoppler Filtering
9 10 11 12
5 6 7 8
1 2 3 4
Range
Pulses
Cha
nnel
s
1
2
3
4
Ran
ge
Pulses
Chann
els
Rotation
Reference: M. F. Skalabrin and T. H. Einstein, “STAP Processing on a Multicomputer,” Proceedings of the Adaptive Sensor Array Workshop, March 1996.
STAP Data Cube Re-PartitioningQR-D, Back-Substitution, Beamforming
QR-Decomposition &Back-Substitution
QR-Decomposition &Back-Substitution
Re-PartitioningApply 3x4 Process Set:Re-PartitioningApply 3x4 Process Set:
Dop
pler
1 32 4
5 76 8
9 1110 12
Range BeamformingBeamforming
1
2
3
4
Ran
geChannel
Dopple
r
RotationChannel
Ran
ge
Doppler
4 8 12
3 7 11
2 6 10
1 5 9
Doppler
Beam Outputs
Ran
ge
Outline
• STAP Basics
• Parallelization of STAP
• Impact of Mapping and Scheduling Choices on Communication Delay
• RACE Network Simulator
• Summary
RACEway Interface
SHARC Compute Node Architecture
SHARCSHARC
SHARCSHARC
SHARCProcessorSHARC
ProcessorECC LogicECC Logic DRAMDRAM
PerformanceMetering
PerformanceMetering
DMAController
DMAController
3-WayData
Switch
3-WayData
Switch
RACEwayMapping
Logic
RACEwayMapping
Logic
OSSupport
Hardware
OSSupport
Hardware
CN ASIC
Data Set Re-Partitioningwith raster ordering in the channel dimension
63
912
52
811
41
710Channels
Pulses
Ran
ge
11
10
128
7
95
4
6
PulsesRange
Cha
nnel
s
2
1
3
Re-Partitioning
Required Data TransfersTotal Message Size Count
= 36 units
1
CN
CN
CN
33
3
10
CN
CN
CN
33
3
4 CN
CN
33
3
CN
7 CN
CN
33
3
CN
PulsesRange
Cha
nnel
s 12
8
4
11
7
3
10
6
2
9
5
1
109
1112
65
7
8
Channels Pulses
Ran
ge
1
34
2Re-Partitioning
1
4CN
7
10
CN
CN
CN
CN
CN
3
4
3
3
4
3
Required Data TransfersTotal Message Size Count
= 20 units
Data Set Re-Partitioningwith raster ordering in the pulse dimension
Required Data TransfersRequired Data Transfers
STAP Data Cube Re-PartitioningData Re-Distribution Mapping
Network Interconnection ConfigurationNetwork Interconnection Configuration
6-PortCrossbar
CN CN CN CN
12
3
45
6 78
9
1011
12
IPC
56
78
910
1112
Cha
nnel
12
34Pulses Range
Pulse Compression
1
4CN
7
10
CN
CN
CN
CN
CN
3
4
3
3
4
3
Doppler Filtering
Pulses
Cha
nnel
Range
9 10 11 12
5 6 7 8
1 2 3 4
Outgoing Message Queues
CN CN CNCN
A (3)
E (4)
B (3) D (3)
F (3)
C (4)
6-PortCrossbar
Minimum Communication Time:
Tmin( , , , ) = Tc ( ) = Tc ( )
Tc ( ) = Toutgoing + T incoming = (3 + 4) + (3 + 4)
= 14 network cycles
CN CN CNCN CN CN
CN
Scheduling Communications
DA
6-PortCrossbar
Actual Communication Time
AT (0) = max[Tp ( ), Tp ( )]
= 3 network cycles (n.c.)
D6-Port
Crossbar
A (3) D (3)
B
6-PortCrossbar
T (3) = Tp ( ) = 3 n.c.B
B (3)
6-PortCrossbar
Tc = 3 network cycles6 network cycles
E
6-PortCrossbar
T (6) = Tp ( ) = 4 n.c.E
10 network cycles
6-PortCrossbar
E (4)
6-PortCrossbar
C
C (4)
6-PortCrossbar
14 network cycles
CT (10) = Tp ( ) = 4 n.c.
6-PortCrossbar
F
F (3)
6-PortCrossbar
17 network cycles
T (14) = Tp ( ) = 3 n.c.F
Actual Communication Time
Outgoing Message Queues
CN CN CNCN
6-PortCrossbar
A (3)
E (4)
B (3) D (3)F (3)
C (4)
A
F
6-PortCrossbar
Scheduling CommunicationsExchange Messages and in QueueCNCNC F
A FT (0) = max [Tp ( ), Tp ( )]
= 3 network cycles (n.c.)
Tc = 3 network cycles
6-PortCrossbar
A (3) F (3)
6-PortCrossbar
D
B T (3) = max [Tp ( ), Tp ( )]
= 3 n.c.
DB
6-PortCrossbar
B (3) D (3)A (3) F (3)
6 network cycles
6-PortCrossbar
E
T (6) = Tp ( ) = 4 n.c.E
10 network cycles
T (10) = Tp ( ) = 4 n.c.C
6-PortCrossbar
E (4)
6-PortCrossbar
C
14 network cyclesC (4)
6-PortCrossbar
Predicting Communication Time
• Predicting communication time at each phase depends on many factors:
– Number of CNs used.
– Size of the data cube.
– Topology of the interconnection network.
– Data decomposition scheme used (mapping).
– Scheduling of the communications.
Outline
• STAP Basics
• Parallelization of STAP
• Impact of Mapping and Scheduling Choices on Communication Delay
• RACE Network Simulator
• Summary
RACE Network Interconnect Fat-Tree Topology
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCNCNCN CNCN CNCN CNCN
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
CN
6-PortCrossbar
6-PortCrossbar
Message DestinationMessage DestinationMessage SourceMessage Source
MessagePath
MessagePath
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
CN
Some Features of the RACE Network
1. 40Mhz clock, 32 bit data paths, 2048 byte circuit-switched packets.
2. Contention resolved using priorities• User-programmable message priority
• Hardware priority assigned at each crossbar along a path (based on complex connection rules)
3. A message with higher priority preempts (suspends) a lower priority message (active or inactive) to gain control of a crossbar port
Reference: The RACE Multicomputer: Hardware Theory and Operation, Vol. 1, Mercury Computer Systems, Inc, 1995.
SSIMULATORIMULATOR DDATAATA FFLOWLOW
SimulateMessageTraffic
SimulateMessageTraffic
BuildNetworkRoutingTable
BuildNetworkRoutingTable
Build -Order
MessageTraffic
Build -Order
MessageTraffic
• Network Size• STAP Data Cube
Parameters• Process Set
Configuration• Chaining Options• Message Ordering
• Network Size• STAP Data Cube
Parameters• Process Set
Configuration• Chaining Options• Message Ordering
User Interface
• Display the TimingResults
• Display the TimingResults
User Interface
BuildNetwork
BuildNetwork
Build STAPData Cube
Build STAPData Cube
BuildProcess
Set
BuildProcess
Set
Network Object
Crossbar ObjectCrossbar Object Crossbar ObjectCrossbar Object
Crossbar ObjectCrossbar Object
NNETWORKETWORK OOBJECTBJECT
Compute Node ObjectProcessor InformationMessage QueuesPacket Stack
Compute Node ObjectProcessor InformationMessage QueuesPacket Stack
Link ObjectLink Object
Random Scan ObjectGenerates Pseudo-Random CN Scan Ordering
Random Scan ObjectGenerates Pseudo-Random CN Scan Ordering
Simulation Clock ObjectBased on Network Clock FrequencyData Transfer Rate Equates to Effective Network Bandwidth
Simulation Clock ObjectBased on Network Clock FrequencyData Transfer Rate Equates to Effective Network Bandwidth
Dynamic Network ConstructionDynamic Routing Table CreationDynamic CN Message Traffic GenerationSimulates “Corner-Turn” Traffic
Dynamic Network ConstructionDynamic Routing Table CreationDynamic CN Message Traffic GenerationSimulates “Corner-Turn” Traffic
Network Object Methods
CCROSSBARROSSBAR OOBJECTBJECT
Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM
Query Port StatusRoutes Packets to Next LocationAllocates and Frees Port Connections andConnected Link ObjectsTransmits Packet Data
Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM
Query Port StatusRoutes Packets to Next LocationAllocates and Frees Port Connections andConnected Link ObjectsTransmits Packet Data
Crossbar Object Methods
Link ObjectConnects Crossbars or Compute NodesLink Status: Occupied or Free
Link ObjectConnects Crossbars or Compute NodesLink Status: Occupied or Free
Crossbar ObjectTwo Parent Port ConnectionsFour Child Port ConnectionsInternal Switch ConnectionsTerminal Crossbars Contain CN Connections
Crossbar ObjectTwo Parent Port ConnectionsFour Child Port ConnectionsInternal Switch ConnectionsTerminal Crossbars Contain CN Connections
CCOMPUTEOMPUTE NNODEODE OOBJECTBJECT
Manages Outgoing and Incoming MessageQueuesManages Packet StackExplodes the Top Outgoing Message into Packets of Size 2048 or LessHandles DMA Chaining of PacketsEstablishes Path Through Network and Transmits Packet Data
Manages Outgoing and Incoming MessageQueuesManages Packet StackExplodes the Top Outgoing Message into Packets of Size 2048 or LessHandles DMA Chaining of PacketsEstablishes Path Through Network and Transmits Packet Data
Compute Node Methods
Outgoing Message QueueOutgoing Message Queue
Message 1
Message 2
Message 3
::
Packet StackPacket StackEXPLODE
Compute Node ObjectProcessor InformationOutgoing and Incoming Message QueuesPacket Stack
• PACKETS ARE SELF-ROUTING
Compute Node ObjectProcessor InformationOutgoing and Incoming Message QueuesPacket Stack
• PACKETS ARE SELF-ROUTING
::
Packet 2Packet 3Packet 4
Packet 1
SSIMULATIONIMULATION SSTATETATEDDIAGRAMIAGRAM
Message Queue Status
Message Queue Status
Packet Stack Status
Packet Stack Status
KilledKilledSuspendedSuspendedExitExit
ExplodeTop Msg
into Packets
ExplodeTop Msg
into Packets
ReadyReady
Start UpStart Up ActiveActiveBlockedBlocked
CompletedCompleted
EmptyNot Empty
Empty Not Empty
Not Empty
Legend• Control• First Pass• Second Pass
Legend• Control• First Pass• Second Pass
Outline
• STAP Basics
• Parallelization of STAP
• Impact of Mapping and Scheduling Choices on Communication Delay
• RACE Network Simulator
• Summary
Summary
• Designed and implemented network simulator that models the impact of data mapping and scheduling on the performance of a STAP
• Implemented in OO paradigm using JAVA (approximately 6000 LOC)
• Implementation of all network simulation objects completed. Currently testing and completing user interface objects.
• Can be used with any parallel application that has distinct and definable communication phases.
• Very fast execution time - Portable - Available on-line soon!