Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation...

28
Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on a Parallel Embedded System by Jack M. West and John K. Antonio Texas Tech University {west, antonio}@ttu.edu EHPC ‘98: Workshop on Embedded HPC Systems and Applications April 3, 1998 Supported by Rome Laboratory Grant No. F30602-96-1-0098 and DARPA Contract No. F30602-97-2-0297

Transcript of Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation...

Page 1: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

Simulation of the Communication Time for aSpace-Time Adaptive Processing Algorithm on a

on a Parallel Embedded System

byJack M. West and John K. Antonio

Texas Tech University{west, antonio}@ttu.edu

EHPC ‘98:Workshop on Embedded HPC Systems and Applications

April 3, 1998

Supported by Rome Laboratory Grant No. F30602-96-1-0098 and DARPA Contract No. F30602-97-2-0297

Page 2: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

Outline

• STAP Basics

• Parallelization of STAP

• Impact of Mapping and Scheduling Choices on Communication Delay

• RACE Network Simulator

• Summary

Page 3: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

Space-Time Adaptive Processing

• Space-Time Adaptive Processing (STAP) refers to a class of signal processing methods that operate on data collected from a set of sensors over a given time interval.

• STAP simultaneously combines the signals received from an antenna array (spatial domain) and multiple pulse repetition periods (time domain).

• STAP provides improved detection of smaller targets in the presence of ground clutter (overland and littoral environments) and hostile interference (electronic counter measures and jamming).

Page 4: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

Typical STAP Processing Flow

Pulses Pulses

Data Cube

Data Cube

Doppler Filter

Channels

Ran

ge

Ran

ge

Channels

Beamform

Beam Outputs

Ran

ge

Pulses

QR Decomposition

Rotate

Channels

Ran

ge

Pulses

Data Cube

Steering Vectors

Weights

Input Data

RotatePulse

Compress

Data CubeC

hann

els

Pulses

Range

Page 5: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

Outline

• STAP Basics

• Parallelization of STAP

• Impact of Mapping and Scheduling Choices on Communication Delay

• RACE Network Simulator

• Summary

Page 6: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

Sub-Cube Bar Partitioning Parallelization Technique

1. Partition STAP data cube over a 2-D process set.

2. Process the contiguous dimension.

3. Re-partition the data cube before processing the next dimension.

4. Rotate the newly distributed data to make the next dimension sequential in memory.

5. Repeat steps 1 through 4 before each processing phase.

Page 7: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

Pulse Compression Partitioningwith range dimension whole.Pulse Compression Partitioningwith range dimension whole.

Illustration of STAP Data Cube Partitioning

Pulses Range

Cha

nnel

s

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Pulses

+

3 x 4 Process Set

Pulses

5

1

9

Range

Cha

nnel

s

Doppler Filtering Partitioningwith pulses dimension whole.Doppler Filtering Partitioningwith pulses dimension whole.

Pulses Range

Cha

nnel

s

9 10 11 12

5 6 7 8

1 2 3 4

Pulses Range

Cha

nnel

s

+

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Range

3 x 4 Process Set

Page 8: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

STAP Data Cube Re-Partitioning

Pulses

5

1

9

Range

Cha

nnel

s• Re-Partitioning involves exchanging data with the next whole dimension.

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Pulses

3 x 4 Process Set

Range Dimension is Contiguous

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Range

3 x 4 Process Set

Pulse Dimension is Contiguous

• Interprocessor Communication is required between processors in the same row.

Pulses

Range

Cha

nnel

s

9 10 11 12

5 6 7 8

1 1 1 2 1 3 1 4

Page 9: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

STAP Data Cube Re-PartitioningRange and Doppler Processing

Pulse CompressionPulse Compression

Re-PartitioningApply 3x4 Process Set:Re-PartitioningApply 3x4 Process Set:

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Range

Pulses

5

1

9

Range

Cha

nnel

s

Doppler FilteringDoppler Filtering

9 10 11 12

5 6 7 8

1 2 3 4

Range

Pulses

Cha

nnel

s

1

2

3

4

Ran

ge

Pulses

Chann

els

Rotation

Reference: M. F. Skalabrin and T. H. Einstein, “STAP Processing on a Multicomputer,” Proceedings of the Adaptive Sensor Array Workshop, March 1996.

Page 10: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

STAP Data Cube Re-PartitioningQR-D, Back-Substitution, Beamforming

QR-Decomposition &Back-Substitution

QR-Decomposition &Back-Substitution

Re-PartitioningApply 3x4 Process Set:Re-PartitioningApply 3x4 Process Set:

Dop

pler

1 32 4

5 76 8

9 1110 12

Range BeamformingBeamforming

1

2

3

4

Ran

geChannel

Dopple

r

RotationChannel

Ran

ge

Doppler

4 8 12

3 7 11

2 6 10

1 5 9

Doppler

Beam Outputs

Ran

ge

Page 11: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

Outline

• STAP Basics

• Parallelization of STAP

• Impact of Mapping and Scheduling Choices on Communication Delay

• RACE Network Simulator

• Summary

Page 12: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

RACEway Interface

SHARC Compute Node Architecture

SHARCSHARC

SHARCSHARC

SHARCProcessorSHARC

ProcessorECC LogicECC Logic DRAMDRAM

PerformanceMetering

PerformanceMetering

DMAController

DMAController

3-WayData

Switch

3-WayData

Switch

RACEwayMapping

Logic

RACEwayMapping

Logic

OSSupport

Hardware

OSSupport

Hardware

CN ASIC

Page 13: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

Data Set Re-Partitioningwith raster ordering in the channel dimension

63

912

52

811

41

710Channels

Pulses

Ran

ge

11

10

128

7

95

4

6

PulsesRange

Cha

nnel

s

2

1

3

Re-Partitioning

Required Data TransfersTotal Message Size Count

= 36 units

1

CN

CN

CN

33

3

10

CN

CN

CN

33

3

4 CN

CN

33

3

CN

7 CN

CN

33

3

CN

Page 14: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

PulsesRange

Cha

nnel

s 12

8

4

11

7

3

10

6

2

9

5

1

109

1112

65

7

8

Channels Pulses

Ran

ge

1

34

2Re-Partitioning

1

4CN

7

10

CN

CN

CN

CN

CN

3

4

3

3

4

3

Required Data TransfersTotal Message Size Count

= 20 units

Data Set Re-Partitioningwith raster ordering in the pulse dimension

Page 15: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

Required Data TransfersRequired Data Transfers

STAP Data Cube Re-PartitioningData Re-Distribution Mapping

Network Interconnection ConfigurationNetwork Interconnection Configuration

6-PortCrossbar

CN CN CN CN

12

3

45

6 78

9

1011

12

IPC

56

78

910

1112

Cha

nnel

12

34Pulses Range

Pulse Compression

1

4CN

7

10

CN

CN

CN

CN

CN

3

4

3

3

4

3

Doppler Filtering

Pulses

Cha

nnel

Range

9 10 11 12

5 6 7 8

1 2 3 4

Page 16: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

Outgoing Message Queues

CN CN CNCN

A (3)

E (4)

B (3) D (3)

F (3)

C (4)

6-PortCrossbar

Minimum Communication Time:

Tmin( , , , ) = Tc ( ) = Tc ( )

Tc ( ) = Toutgoing + T incoming = (3 + 4) + (3 + 4)

= 14 network cycles

CN CN CNCN CN CN

CN

Scheduling Communications

DA

6-PortCrossbar

Actual Communication Time

AT (0) = max[Tp ( ), Tp ( )]

= 3 network cycles (n.c.)

D6-Port

Crossbar

A (3) D (3)

B

6-PortCrossbar

T (3) = Tp ( ) = 3 n.c.B

B (3)

6-PortCrossbar

Tc = 3 network cycles6 network cycles

E

6-PortCrossbar

T (6) = Tp ( ) = 4 n.c.E

10 network cycles

6-PortCrossbar

E (4)

6-PortCrossbar

C

C (4)

6-PortCrossbar

14 network cycles

CT (10) = Tp ( ) = 4 n.c.

6-PortCrossbar

F

F (3)

6-PortCrossbar

17 network cycles

T (14) = Tp ( ) = 3 n.c.F

Page 17: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

Actual Communication Time

Outgoing Message Queues

CN CN CNCN

6-PortCrossbar

A (3)

E (4)

B (3) D (3)F (3)

C (4)

A

F

6-PortCrossbar

Scheduling CommunicationsExchange Messages and in QueueCNCNC F

A FT (0) = max [Tp ( ), Tp ( )]

= 3 network cycles (n.c.)

Tc = 3 network cycles

6-PortCrossbar

A (3) F (3)

6-PortCrossbar

D

B T (3) = max [Tp ( ), Tp ( )]

= 3 n.c.

DB

6-PortCrossbar

B (3) D (3)A (3) F (3)

6 network cycles

6-PortCrossbar

E

T (6) = Tp ( ) = 4 n.c.E

10 network cycles

T (10) = Tp ( ) = 4 n.c.C

6-PortCrossbar

E (4)

6-PortCrossbar

C

14 network cyclesC (4)

6-PortCrossbar

Page 18: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

Predicting Communication Time

• Predicting communication time at each phase depends on many factors:

– Number of CNs used.

– Size of the data cube.

– Topology of the interconnection network.

– Data decomposition scheme used (mapping).

– Scheduling of the communications.

Page 19: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

Outline

• STAP Basics

• Parallelization of STAP

• Impact of Mapping and Scheduling Choices on Communication Delay

• RACE Network Simulator

• Summary

Page 20: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

RACE Network Interconnect Fat-Tree Topology

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCNCNCN CNCN CNCN CNCN

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

CN

6-PortCrossbar

6-PortCrossbar

Message DestinationMessage DestinationMessage SourceMessage Source

MessagePath

MessagePath

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

CN

Page 21: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

Some Features of the RACE Network

1. 40Mhz clock, 32 bit data paths, 2048 byte circuit-switched packets.

2. Contention resolved using priorities• User-programmable message priority

• Hardware priority assigned at each crossbar along a path (based on complex connection rules)

3. A message with higher priority preempts (suspends) a lower priority message (active or inactive) to gain control of a crossbar port

Reference: The RACE Multicomputer: Hardware Theory and Operation, Vol. 1, Mercury Computer Systems, Inc, 1995.

Page 22: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

SSIMULATORIMULATOR DDATAATA FFLOWLOW

SimulateMessageTraffic

SimulateMessageTraffic

BuildNetworkRoutingTable

BuildNetworkRoutingTable

Build -Order

MessageTraffic

Build -Order

MessageTraffic

• Network Size• STAP Data Cube

Parameters• Process Set

Configuration• Chaining Options• Message Ordering

• Network Size• STAP Data Cube

Parameters• Process Set

Configuration• Chaining Options• Message Ordering

User Interface

• Display the TimingResults

• Display the TimingResults

User Interface

BuildNetwork

BuildNetwork

Build STAPData Cube

Build STAPData Cube

BuildProcess

Set

BuildProcess

Set

Network Object

Page 23: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

Crossbar ObjectCrossbar Object Crossbar ObjectCrossbar Object

Crossbar ObjectCrossbar Object

NNETWORKETWORK OOBJECTBJECT

Compute Node ObjectProcessor InformationMessage QueuesPacket Stack

Compute Node ObjectProcessor InformationMessage QueuesPacket Stack

Link ObjectLink Object

Random Scan ObjectGenerates Pseudo-Random CN Scan Ordering

Random Scan ObjectGenerates Pseudo-Random CN Scan Ordering

Simulation Clock ObjectBased on Network Clock FrequencyData Transfer Rate Equates to Effective Network Bandwidth

Simulation Clock ObjectBased on Network Clock FrequencyData Transfer Rate Equates to Effective Network Bandwidth

Dynamic Network ConstructionDynamic Routing Table CreationDynamic CN Message Traffic GenerationSimulates “Corner-Turn” Traffic

Dynamic Network ConstructionDynamic Routing Table CreationDynamic CN Message Traffic GenerationSimulates “Corner-Turn” Traffic

Network Object Methods

Page 24: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

CCROSSBARROSSBAR OOBJECTBJECT

Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM

Query Port StatusRoutes Packets to Next LocationAllocates and Frees Port Connections andConnected Link ObjectsTransmits Packet Data

Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM

Query Port StatusRoutes Packets to Next LocationAllocates and Frees Port Connections andConnected Link ObjectsTransmits Packet Data

Crossbar Object Methods

Link ObjectConnects Crossbars or Compute NodesLink Status: Occupied or Free

Link ObjectConnects Crossbars or Compute NodesLink Status: Occupied or Free

Crossbar ObjectTwo Parent Port ConnectionsFour Child Port ConnectionsInternal Switch ConnectionsTerminal Crossbars Contain CN Connections

Crossbar ObjectTwo Parent Port ConnectionsFour Child Port ConnectionsInternal Switch ConnectionsTerminal Crossbars Contain CN Connections

Page 25: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

CCOMPUTEOMPUTE NNODEODE OOBJECTBJECT

Manages Outgoing and Incoming MessageQueuesManages Packet StackExplodes the Top Outgoing Message into Packets of Size 2048 or LessHandles DMA Chaining of PacketsEstablishes Path Through Network and Transmits Packet Data

Manages Outgoing and Incoming MessageQueuesManages Packet StackExplodes the Top Outgoing Message into Packets of Size 2048 or LessHandles DMA Chaining of PacketsEstablishes Path Through Network and Transmits Packet Data

Compute Node Methods

Outgoing Message QueueOutgoing Message Queue

Message 1

Message 2

Message 3

::

Packet StackPacket StackEXPLODE

Compute Node ObjectProcessor InformationOutgoing and Incoming Message QueuesPacket Stack

• PACKETS ARE SELF-ROUTING

Compute Node ObjectProcessor InformationOutgoing and Incoming Message QueuesPacket Stack

• PACKETS ARE SELF-ROUTING

::

Packet 2Packet 3Packet 4

Packet 1

Page 26: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

SSIMULATIONIMULATION SSTATETATEDDIAGRAMIAGRAM

Message Queue Status

Message Queue Status

Packet Stack Status

Packet Stack Status

KilledKilledSuspendedSuspendedExitExit

ExplodeTop Msg

into Packets

ExplodeTop Msg

into Packets

ReadyReady

Start UpStart Up ActiveActiveBlockedBlocked

CompletedCompleted

EmptyNot Empty

Empty Not Empty

Not Empty

Legend• Control• First Pass• Second Pass

Legend• Control• First Pass• Second Pass

Page 27: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

Outline

• STAP Basics

• Parallelization of STAP

• Impact of Mapping and Scheduling Choices on Communication Delay

• RACE Network Simulator

• Summary

Page 28: Simulation of the Communication Time for a Space-Time ...antonio/pubs/p-conf036.pdf · Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a on

Summary

• Designed and implemented network simulator that models the impact of data mapping and scheduling on the performance of a STAP

• Implemented in OO paradigm using JAVA (approximately 6000 LOC)

• Implementation of all network simulation objects completed. Currently testing and completing user interface objects.

• Can be used with any parallel application that has distinct and definable communication phases.

• Very fast execution time - Portable - Available on-line soon!