Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf ·...
Transcript of Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf ·...
![Page 1: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/1.jpg)
Optimal Configuration of Combined GPP/DSP/FPGA Systems for
Minimal SWAPby
John K. AntonioDepartment of Computer Science
College of EngineeringTexas Tech University
First Annual ReviewJune 23, 1998
![Page 2: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/2.jpg)
OutlineOutline
• Program Overview and Introduction (Quad Chart)
• Program Management Status
• Recent Accomplishments
• Status of Deliverable Checklist
![Page 3: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/3.jpg)
Configuring Combined GPP/DSP/FPGA Systems for Minimal SWAPApplications
• SAR• STAP
Requirements• Throughput• SWAP
•Combined Technology•Minimal SWAP Configuration•Mixed-Mode Operation•Demonstration
Texas Tech University: John K. Antonio
New Ideas• Systematic determination of minimal SWAP
configuration based on proven mathematical programming techniques
• Optimal configuration based on automatic“tuning” of system design parameters- number and types of cards used- data mapping and communication schemes- place and route schemes
• Novel computing techniques based oncharacteristics of GPP/DSP/FPGA system
Jun 97Start
Jun 98 Jun 99 Dec 99End
ScheduleDevelop optimalconfigurationtechniques
Construction and integration of GPP/DSP/FPGA system
Implement and test optimal configurations onGPP/DSP/FPGA system
Develop practicaldesign methodsbased on SAR andSTAP applications
Demonstrate advantagesof combiningtechnologies
Impact• Embedded Systems requirements for the
21st Century can be satisfied with thecombined use of GPP, DSP, and FPGA technologies
• Demonstrate use of FPGA boards as co-processors for embedded multiprocessorGPP and DSP systems
• Demonstrate systematic approaches tooptimally configure GPP/DSP/FPGA syst. forminimal SWAP for embedded applications
![Page 4: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/4.jpg)
OutlineOutline
• Program Overview and Introduction (Quad Chart)
• Program Management Status
• Recent Accomplishments
• Status of Deliverable Checklist
![Page 5: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/5.jpg)
Personnel (Program Management Status)
• John K. Antonio, Principal Investigator
• Ph.D., EE, Texas A&M Univ. (1989)
• Currently Assoc. Prof. of CS, Texas Tech Univ.
• Over 65 publications in HPC and related areas
• PI or co-PI of 17 contracts/grants
totaling over $2.1M
![Page 6: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/6.jpg)
• Jeff Muehring, Research Assistant, Ph.D. student
Optimal GPP/DSP/FPGA Configuration Techniques for SAR
Intern at IBM/Houston, 1/98 to 6/98
• Jack West, Research Assistant, Ph.D. student
Optimal Mapping, Scheduling, and Configuration Techniques for STAP; Network Simulator
Personnel (Program Management Status)
![Page 7: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/7.jpg)
• Nikhil Gupta, Research Assistant, M.S. student
Algorithms for STAP Weight Calculation Mapping Inner Product Computations onto FPGAs
Graduating July 1998
• Tim Osmulski, Research Assistant, M.S. student
Power Prediction Simulator for FPGAs
Graduated May 1998
Personnel (Program Management Status)
![Page 8: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/8.jpg)
• Brian Veale, Research Assistant, M.S. student
Calibration of FPGA power prediction model; Implementation of STAP core on GPP/FPGA
New RA as of May 1998
• New Student, Research Assistant, M.S. student
Implementation of SAR core on GPP/FPGA
To be hired September 1998
Personnel (Program Management Status)
![Page 9: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/9.jpg)
Contacts, Partners, Vendors, and Other Communications (Program Management Status)
José Muñoz, DARPA Ralph Kohler, Rome Lab
MIT Lincoln LabDavid MartinezJim Ward
MITRERichard Games
Northrop GrummanMarc Campbell
Synplicity, Inc. Madelyn Miller
XilinxJason Feinsmith
Annapolis Micro SystemsJenny DonaldsonBill HulbertPaul Kowalewski
ISIMilissa BenincasaDavid Coker
Mercury ComputerThomas EinsteinEd HolstienCraig LundDave Toms
![Page 10: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/10.jpg)
Mercury20 Slot Hybrid Chassis with SPARC 5VSolaris 2.5 with C Compiler9U VME RACE BoardSHARC Daughtercard (2CNs, 8MB/CN)SHARC Daughtercard (2CNs, 16MB/CN)SHARC Daughtercard (2CNs, 16MB/CN)MC/OS, Cross Assembler, Toolkit PowerPC Daughtercard (2CNs, 16MB/CN)
Annapolis Micro SystemsPCI WILDONE Card (1 Xilinx 4028EX-3)VME WILDFIRE Array Card (16 Xilinx 4028EX-3s)
Other VendorsModelSim Simulation Software (Model Technology, Inc.)Synplify Synthesis Software (Synplicity, Inc.)Xilinx Foundation Software (Xilinx, Inc.)
Equipment Status(Program Management Status)
√√√
√
√
√
√√√
![Page 11: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/11.jpg)
Schedule of Milestones
June 1997 June 1998 June 1999 Dec. 1999Dec. 1998Dec. 1997
Design STAPIterative Weight Solver for FPGA
Inter-GPP/DSP Comm.Simulator for STAP
Optimal GPP/DSPConfig. for SAR
GPP/DSP/FPGA Platform Construction and Independent Testing of GPP/DSP and FPGA Subsystems
Implement STAP Iterative Weight Solver on FPGA
Optimal GPP/DSPConfig. for STAP
Implement SAR Linear Filteringon FPGA
Optimal GPP/DSP/FPGAConfig. for SAR/STAP
GPP/DSP and FPGA Subsystem Integration and Testing
Optimal GPP/DSP/FPGA Config. for SAR
Demonstrate Combined SAR/STAP onGPP/DSP/FPGA Platform
Implement SAR on GPP/DSP
Design SAR Linear Filteringfor FPGA
Implement STAP on GPP/DSP
Implement SAR onGPP/DSP/FPGA Platform
Optimal GPP/DSP/FPGA Config. for STAP
Implement STAP onGPP/DSP/FPGA Platform
Develop FPGA Power Consumption Simulator
KeyGPP/DSP Sub-System
Research/DesignImplement/Test
FPGA Sub-SystemResearch/DesignImplement/Test
GPP/DSP/FPGA SystemResearch/DesignImplement/Test
Test FPGA Power Consumption Simulator
![Page 12: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/12.jpg)
FY 97Approved
FY 98Approved
FY 98Required*
FY 98“Deficit”
Personnel 22,066 56,710 84,517 27,807
Fringes 7,575 18,871 25,723 6,852
Consulting 0 0 15,000 15,000
Expenses 260 3,321 4,500 1,179
Travel 0 4,500 4,500 0
Equipment 74,000 55,608 85,088 29,480
Indirect Cost 13,634 39,198 62,623 23,425
Total 116,644 178,208 281,951 103,743
FY 97 and FY 98 Budgets(Program Management Status)
*Required to maintain 30 month completion date (i.e., 12/31/99).
![Page 13: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/13.jpg)
FY 99 FY 00 ProjectTotal
Personnel 138,536 52,401 297,520
Fringes 39,911 14,404 87,614
Consulting 25,000 10,000 50,000
Expenses 7,078 3,000 14,838
Travel 12,000 5,000 20,500
Equipment 59,892 0 217,670
Indirect Cost 104,587 39,858 221,121
Total 387,004 124,664 909,262
FY 99 and FY 00 Budgets(Program Management Status)
![Page 14: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/14.jpg)
OutlineOutline
• Program Overview and Introduction (Quad Chart)
• Program Management Status
• Recent Accomplishments
• Status of Deliverable Checklist
![Page 15: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/15.jpg)
Recent Accomplishments
• Network Communication Time Simulator for Parallel STAP
• FPGA Inner-Product Co-Processor Designs for STAP Weight Solver
• Power Prediction Simulator for the Xilinx4000-Series FPGA
![Page 16: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/16.jpg)
Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)
• Space-Time Adaptive Processing (STAP) Basics
• Mercury RACE Multicomputer
• Parallelization Approach for STAP
• RACE Network Simulator
• Preliminary Numerical Studies
• Conclusions
![Page 17: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/17.jpg)
J. Ward, “Space-Time Adaptive Processing for Airborne Radar,” Technical Report 1015, MIT Lincoln Laboratory, Lexington, MA, 1994.
M. F. Skalabrin and T. H. Einstein, “STAP Processing on a Multicomputer: Distribution of 3-D Data Sets and Processor Allocation for Optimum InterprocessorCommunication,” Proc. Adaptive Sensor Array Processing (ASAP) Workshop, March 1996.
The RACE Multicomputer, Hardware Theory of Operation: Processors, I/O Interface, and the RACEway Interconnect, Volume I, ver. 1.3.
T. H. Einstein, “Mercury Computer Systems’ Modular Heterogeneous RACEMulticomputer,” Proc. 6th Heterogeneous Comp. Workshop, April 1997, pp. 60-71.
B. C. Kuszmaul, “The RACE Network Architecture,” Proc. 9th Int’l Parallel Processing Symp., April 1995, pp. 508-513.
G. Booch, I. Jacobson, and J. Rumbaugh, “The Unified Modeling Language for Object Oriented Development,” Documentation Set Version 1.1, September 1997.
Related STAP and RACE References
![Page 18: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/18.jpg)
SSPACEPACE--TTIME IME AADAPTIVE DAPTIVE PPROCESSINGROCESSING
1. Space-Time Adaptive Processing (STAP) refers to a class of signal processing methods that operate on data collected from a set of sensors over a given time interval.
2. STAP simultaneously combines the signals received from an antenna array (spatial domain) and multiple pulse repetition periods (time domain).
3. STAP provides improved detection of smaller targets in the presence of ground clutter (overland and littoral environments) and hostile interference (electronic counter measures and jamming).
![Page 19: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/19.jpg)
Pulses Pulses
Data Cube
Data Cube
Doppler Filter
Channels
Ran
ge
Ran
ge
Channels
Beamform
Beam Outputs
Ran
ge
Pulses
QR Decomposition
Rotate
Channels
Ran
ge
Pulses
Data Cube
Steering Vectors
Weights
Input Data
RotatePulse
Compress
Data CubeC
hann
els
Pulses
Range
STAPSTAP PPROCESSING ROCESSING FFLOWLOW
![Page 20: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/20.jpg)
• Mercury RACE Multicomputer
• Space-Time Adaptive Processing (STAP) Basics
• Parallelization Approach for STAP
• RACE Network Simulator
• Preliminary Numerical Studies
• Conclusions
Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)
![Page 21: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/21.jpg)
1. 40Mhz clock, 32 bit data paths, 2048 byte circuit-switched packets.
2. Contention resolved using priorities.a. User-programmable message priority
b. Hardware priority assigned at each crossbar along a path (based on complex connection rules)
3. A packet with higher priority preempts (suspends) a lower priority packet (active or inactive) to gain control of a crossbar port.
SSOMEOME RACERACENNETWORK ETWORK FFEATURESEATURES
![Page 22: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/22.jpg)
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCNCNCN CNCN CNCN CNCN
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
CN
6-PortCrossbar
6-PortCrossbar
Message DestinationMessage DestinationMessage SourceMessage Source
MessagePath
MessagePath
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
CN
RACERACE NNETWORK ETWORK IINTERCONNECTNTERCONNECTFFATAT--TTREE REE TTOPOLOGYOPOLOGY
6-PortCrossbar
6-PortCrossbar
CNCN
6-PortCrossbar
![Page 23: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/23.jpg)
SSTANDARD TANDARD CCROSSBAR ROSSBAR PPRIORITY RIORITY AARBITRATION RBITRATION AALGORITHM LGORITHM TTABLEABLE
7 F A,B,C,D,E F A,B,C,D,E F A,B,C,D6 E F E F A,B,C,D* A,B,C,D*5 A,B,C,D F A,B,C,D F A,B,C,D F4 E A,B,C,D E A,B,C,D - -3 *A,B,C,D *A,B,C,D,E A,B,C,D* A,B,C,D* - -2 - - A,B,C,D E - -1 - - - - - -
HardwarePriority Entry Port Exit Port Entry Port Exit Port Entry Port Exit Port
Active Port E InvolvedNot Yet Active
Port E Not Involved
Transaction Status
* - Peer Kill Rules Apply
![Page 24: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/24.jpg)
RACEway Interface
SHARCSHARC
SHARCSHARC
SHARCProcessorSHARC
ProcessorECC LogicECC Logic DRAMDRAM
PerformanceMetering
PerformanceMetering
DMAController
DMAController
3-WayData
Switch
3-WayData
Switch
RACEwayMapping
Logic
RACEwayMapping
Logic
OSSupport
Hardware
OSSupport
Hardware
CN ASIC
SHARCSHARC CCOMPUTE OMPUTE NNODEODE
![Page 25: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/25.jpg)
• Parallelization Approach for STAP
• Space-Time Adaptive Processing (STAP) Basics
• Mercury RACE Multicomputer
• RACE Network Simulator
• Preliminary Numerical Studies
• Conclusions
Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)
![Page 26: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/26.jpg)
1. Partition STAP data cube over a 2-D process set.
2. Process the contiguous dimension.
3. Re-partition the data cube before processing the next dimension.
4. Rotate the newly distributed data to make the next dimension sequential in memory.
5. Repeat steps 1 through 4 before each processing phase.
SSUBUB--CUBE CUBE BBAR AR PPARTITIONING ARTITIONING MMETHODOLOGYETHODOLOGY
![Page 27: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/27.jpg)
Pulse Compression Partitioningwith range dimension whole.Pulse Compression Partitioningwith range dimension whole.
Pulses Range
Cha
nnel
s
Cha
nnel
s
1 32 4
5 76 8
9 1110 12
Pulses
+
3 x 4 Process Set
Pulses
5
1
9
Range
Cha
nnel
s
Doppler Filtering Partitioningwith pulses dimension whole.Doppler Filtering Partitioningwith pulses dimension whole.
Pulses Range
Cha
nnel
s
9 10 11 12
5 6 7 8
1 2 3 4
Pulses Range
Cha
nnel
s
+
Cha
nnel
s
1 32 4
5 76 8
9 1110 12
Range
3 x 4 Process Set
STAPSTAP DDATA ATA CCUBE UBE PPARTITIONING ARTITIONING EEXAMPLESXAMPLES
![Page 28: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/28.jpg)
Pulses
5
1
9
Range
Cha
nnel
s• Re-Partitioning involves exchanging data with the next whole dimension.
Cha
nnel
s
1 32 4
5 76 8
9 1110 12
Pulses
3 x 4 Process Set
Range Dimension is Contiguous
Cha
nnel
s
1 32 4
5 76 8
9 1110 12
Range
3 x 4 Process Set
Pulse Dimension is Contiguous
• Interprocessor Communication is required between processors in the same row.
Pulses
Range
Cha
nnel
s
9 10 11 12
5 6 7 8
1 1 1 2 1 3 1 4
STAPSTAP DDATA ATA CCUBE UBE RREPARTITIONINGEPARTITIONING
![Page 29: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/29.jpg)
Required Data TransfersRequired Data Transfers
Network Interconnection ConfigurationNetwork Interconnection Configuration
6-PortCrossbar
CN CN CN CN
12
3
45
6 78
9
1011
12
IPC
56
78
910
1112
Cha
nnel
12
34Pulses Range
Pulse Compression
1
4CN
7
10
CN
CN
CN
CN
CN
3
4
3
3
4
3
Doppler Filtering
Pulses
Cha
nnel
Range
9 10 11 12
5 6 7 8
1 2 3 4
STAPSTAP DDATA ATA CCUBE UBE RREPARTITIONINGEPARTITIONING
Data ReData Re--distribution Mappingdistribution Mapping
![Page 30: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/30.jpg)
• RACE Network Simulator
• Space-Time Adaptive Processing (STAP) Basics
• Mercury RACE Multicomputer
• Parallelization Approach for STAP
• Preliminary Numerical Studies
• Conclusions
Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)
![Page 31: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/31.jpg)
1. Design and implement a network simulator that models the effect data mapping and scheduling has on the performance of a STAP algorithm.
2. Key features of the network simulator include:a. Developed and implemented in an OO paradigm.
b. Implemented using a sub-cube bar partitioning scheme.
c. Models both sub-cube bar mapping strategies and communication scheduling during both phases of data re-partitioning.
d. Completely generic.
RRESEARCH ESEARCH OOBJECTIVESBJECTIVESfor for SSIMULATORIMULATOR
![Page 32: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/32.jpg)
NetworkNetwork
ClockClock
CrossbarCrossbar Routing TableRouting Table
File OutputFile Output
Random ScanRandom Scan
Data CubeData Cube
Process SetProcess Set
1
11
1
1..*
1
1
1
Gets Data From
UML NUML NETWORK ETWORK CCLASS LASS DDIAGRAMIAGRAM
11
![Page 33: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/33.jpg)
CrossbarCrossbar
LinkLink Compute NodeCompute Node
Message QueueMessage Queue Packet StackPacket Stack
MessageMessage PacketPacket
UML CUML CROSSBAR ROSSBAR CCLASS LASS DDIAGRAMIAGRAM
0..*0..*
1 1
1
2
1
2
2,6
11
0,4
![Page 34: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/34.jpg)
DataData
MessageMessage PacketPacket
Header RouteList
Header RouteList
RouteRoute
Abstract ClassInheritance
UML DUML DATA ATA CCLASS LASS DDIAGRAMIAGRAM
11
1..*
1
![Page 35: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/35.jpg)
CrossbarCrossbar CrossbarCrossbar
CrossbarCrossbar
Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack
Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack
LinkLink
Random ScanGenerates Pseudo-Random CN Scan Ordering
Random ScanGenerates Pseudo-Random CN Scan Ordering
ClockBased on Network Clock Frequency (factor of 5)Data Transfer Rate Equates to Effective Network Bandwidth
ClockBased on Network Clock Frequency (factor of 5)Data Transfer Rate Equates to Effective Network Bandwidth
Dynamic Network ConstructionDynamic Routing Table CreationDynamic CN and CE Message Traffic GenerationSimulates Packet Traffic
Dynamic Network ConstructionDynamic Routing Table CreationDynamic CN and CE Message Traffic GenerationSimulates Packet Traffic
Network Methods
NNETWORK ETWORK CCLASS LASS DDETAILSETAILS
![Page 36: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/36.jpg)
Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM
Query Port StatusRoutes Packets to Next LocationAllocates and Frees Internal Port Connections and Connected Link ObjectsTransmits Packet Data
Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM
Query Port StatusRoutes Packets to Next LocationAllocates and Frees Internal Port Connections and Connected Link ObjectsTransmits Packet Data
Crossbar Methods
LinkConnects Crossbar Objects Link Status: Occupied or Free
LinkConnects Crossbar Objects Link Status: Occupied or Free
CrossbarTwo Parent Port ConnectionsFour Child Port ConnectionsInternal Switch ConnectionsFour CN Connections for TerminalCrossbars.
CrossbarTwo Parent Port ConnectionsFour Child Port ConnectionsInternal Switch ConnectionsFour CN Connections for TerminalCrossbars.
CCROSSBAR ROSSBAR CCLASS LASS DDETAILSETAILS
![Page 37: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/37.jpg)
Compute Node Methods:Manages Outgoing and Received MessageQueuesManages Outgoing and Received Packet StackExplodes the Top Outgoing Message into Packets of Size 2048 or LessHandles DMA Chaining of PacketsEstablishes Path Through Network and Transmits Packet Data
Compute Node Methods:Manages Outgoing and Received MessageQueuesManages Outgoing and Received Packet StackExplodes the Top Outgoing Message into Packets of Size 2048 or LessHandles DMA Chaining of PacketsEstablishes Path Through Network and Transmits Packet Data
Outgoing Message QueueOutgoing Message Queue
Message 1
Message 2
Message 3
::
Packet StackPacket StackEXPLODE
Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack
• PACKETS ARE SELF-ROUTING
Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack
• PACKETS ARE SELF-ROUTING
::
Packet 2Packet 3Packet 4
Packet 1
CCOMPUTE OMPUTE NNODE ODE CCLASS LASS DDETAILSETAILS
![Page 38: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/38.jpg)
SSIMULATOR IMULATOR UMLUMLSSEQUENCE EQUENCE DDIAGRAMIAGRAM
NetworkNetwork CrossbarCrossbarData CubeData Cube Process SetProcess Set CNCN<<actor>>
User<<actor>>
User ClockClock
Pass 1
Pass 2
Increment Simulation
Clock
Build Messages
R:200,P:22,C:16
CEs:48
X:6, Y:8
Routing:FCN Traffic,
Phase 1 DMA:Y
Connection/Data
Transfer
Clean Up
Message Matrices
X, Y,MappingMatrices
SimulationTime = 2 msSimulation
Time = 2 ms
Messages Time* iterative process
![Page 39: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/39.jpg)
CCOMPUTEOMPUTE NNODEODEUML SUML STATECHARTTATECHART
Simulation PASS 1Simulation PASS 1Compute Node Subsystem
CurrentPacket
CurrentPacket
PacketStackStatus
PacketStackStatus
MessageQueueStatus
MessageQueueStatus
ExplodeTop
Message
ExplodeTop
Message
PopTop
Packet
PopTop
Packet
Simulation SubsystemSimulation Subsystem
Simulate Pass 1
Simulate Pass 1
GenerateErrorCode
GenerateErrorCode
No Packet EmptyEmpty - Done
Not Empty Not Empty
Success
ErrorError
SuccessPacketFound
![Page 40: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/40.jpg)
CCOMPUTEOMPUTE NNODEODEUML SUML STATECHARTTATECHART
Simulation PASS 2Simulation PASS 2Compute Node Subsystem
CurrentPacket
CurrentPacket
Simulation SubsystemSimulation Subsystem
Simulate Pass 2
Simulate Pass 2
PacketFound
No Packet
![Page 41: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/41.jpg)
PPACKETACKET UML SUML STATECHARTTATECHARTSimulation Simulation Pass 1Pass 1 and and Pass 2Pass 2
Simulation Pass Subsystem
Start UpStart Up
Waitingfor Kill
Waitingfor Kill
CompletedCompletedSuspendedSuspended
BlockedBlocked ActiveActive
ReadyReady
Pass 1
Pass 2
![Page 42: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/42.jpg)
• Preliminary Numerical Studies
• Space-Time Adaptive Processing (STAP) Basics
• Mercury RACE Multicomputer
• Parallelization Approach for STAP
• RACE Network Simulator
• Conclusions
Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)
![Page 43: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/43.jpg)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Process Set - Phase 1 (CN:8, R:800, P:32, C:22, Routing:E)
0
10
20
30
40
50
60
7 8 9 10 11
Time (ms)
Coun
t CN 8 (6x4)CN 8 (4x6)
Process Set - Phase 1 (CN:8, R:800, P:32, C:22, Routing:E)
0
10
20
30
40
50
60
7 8 9 10 11
Time (ms)
Coun
t CN 8 (6x4)CN 8 (4x6)
![Page 44: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/44.jpg)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Process Set - Phase 2 (CN:8, R:800, P:32, C:22, Routing:E)
0123456789
28 30 32 34 36 38 40
Time (ms)
Coun
t CN 8 (6x4)CN 8 (4x6)
Process Set - Phase 2 (CN:8, R:800, P:32, C:22, Routing:E)
0123456789
28 30 32 34 36 38 40
Time (ms)
Coun
t CN 8 (6x4)CN 8 (4x6)
![Page 45: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/45.jpg)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Process Set - Phase 1(CN:16, R:200, P:22, C:16, Routing:F)
02468
101214161820
0.7 0.8 0.9 1 1.1 1.2 1.3
Time (ms)
Coun
t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)
Process Set - Phase 1(CN:16, R:200, P:22, C:16, Routing:F)
02468
101214161820
0.7 0.8 0.9 1 1.1 1.2 1.3
Time (ms)
Coun
t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)
![Page 46: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/46.jpg)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Process Set - Phase 2(CN:16, R:200, P:22, C:16, Routing:F)
0
2
4
6
8
10
12
14
2.5 3 3.5 4 4.5 5 5.5
Time (ms)
Coun
t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)
Process Set - Phase 2(CN:16, R:200, P:22, C:16, Routing:F)
0
2
4
6
8
10
12
14
2.5 3 3.5 4 4.5 5 5.5
Time (ms)
Coun
t CN 16 (12x4)CN 16 (8x6)CN 16 (4x12)
![Page 47: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/47.jpg)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Process Set - Phase 1 (CN:12, R:200, P:22, C:16, Routing:F)
05
101520253035404550
0.5 1 1.5 2
Time (ms)
Coun
t
CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)
![Page 48: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/48.jpg)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Process Set - Phase 2 (CN:12, R:200, P:22, C:16, Routing:F)
0123456789
10
3 3.5 4 4.5 5 5.5 6
Time (ms)
Coun
t
CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)
Process Set - Phase 2 (CN:12, R:200, P:22, C:16, Routing:F)
0123456789
10
3 3.5 4 4.5 5 5.5 6
Time (ms)
Coun
t
CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)
![Page 49: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/49.jpg)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Process Set - Phase 1 (CN:12, R:200, P:22, C:16, Routing:F)
05
101520253035404550
0 0.5 1 1.5 2
Time (ms)
Coun
t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)
Process Set - Phase 1 (CN:12, R:200, P:22, C:16, Routing:F)
05
101520253035404550
0 0.5 1 1.5 2
Time (ms)
Coun
t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)
![Page 50: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/50.jpg)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Process Set - Phase 2 (CN:12, R:200, P:22, C:16, Routing:F)
0
2
4
6
8
10
12
14
2.5 3.5 4.5 5.5 6.5
Time (ms)
Coun
t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)
Process Set - Phase 2 (CN:12, R:200, P:22, C:16, Routing:F)
0
2
4
6
8
10
12
14
2.5 3.5 4.5 5.5 6.5
Time (ms)
Coun
t CN 12 (3x12)CN 12 (12x3)CN 12 (4x9)
![Page 51: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/51.jpg)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Process Set - Phase 1(CN, R:200, P:22, C:16, Routing:F)
0
10
20
30
40
50
60
0 0.2 0.4 0.6 0.8 1
Time (ms)
Coun
t CN 12 (3x12)CN 16 (4x12)
Process Set - Phase 1(CN, R:200, P:22, C:16, Routing:F)
0
10
20
30
40
50
60
0 0.2 0.4 0.6 0.8 1
Time (ms)
Coun
t CN 12 (3x12)CN 16 (4x12)
![Page 52: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/52.jpg)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Process Set - Phase 2(CN, R:200, P:22, C:16, Routing:F)
0
2
4
6
8
10
12
14
2.6 2.8 3 3.2 3.4 3.6 3.8
Time (ms)
Coun
t CN 12 (3x12)CN 16 (4x12)
Process Set - Phase 2(CN, R:200, P:22, C:16, Routing:F)
0
2
4
6
8
10
12
14
2.6 2.8 3 3.2 3.4 3.6 3.8
Time (ms)
Coun
t CN 12 (3x12)CN 16 (4x12)
![Page 53: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/53.jpg)
MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Message Traffic - Phase 1 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)
0123456789
2 2.1 2.2 2.3 2.4 2.5
Time (ms)
Coun
t CN TrafficCE Traffic
Message Traffic - Phase 1 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)
0123456789
2 2.1 2.2 2.3 2.4 2.5
Time (ms)
Coun
t CN TrafficCE Traffic
![Page 54: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/54.jpg)
MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Message Traffic - Phase 2 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)
012345678
10 15 20 25
Time (ms)
Coun
t CN TrafficCE Traffic
Message Traffic - Phase 2 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)
012345678
10 15 20 25
Time (ms)
Coun
t CN TrafficCE Traffic
![Page 55: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/55.jpg)
MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Message Traffic - Phase 1 (CN:16, X:6, Y:8, R:400, P:22, C:16, Routing:EF)
0
10
20
30
40
50
60
0.85 0.851 0.852 0.853 0.854
Time (ms)
Coun
t CN TrafficCE Traffic
Message Traffic - Phase 1 (CN:16, X:6, Y:8, R:400, P:22, C:16, Routing:EF)
0
10
20
30
40
50
60
0.85 0.851 0.852 0.853 0.854
Time (ms)
Coun
t CN TrafficCE Traffic
![Page 56: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/56.jpg)
MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Message Traffic - Phase 2 (CN:16, X:6, Y:8, R:400, P:22, C:16, Routing:EF)
012345678
10 15 20 25
Time (ms)
Coun
t CN TrafficCE Traffic
Message Traffic - Phase 2 (CN:16, X:6, Y:8, R:400, P:22, C:16, Routing:EF)
012345678
10 15 20 25
Time (ms)
Coun
t CN TrafficCE Traffic
![Page 57: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/57.jpg)
MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Message Traffic - Phase 1 (CN:12, X:6, Y:6, R:800, P:32, C:22, Routing:E)
0
10
20
30
40
50
60
4.95 4.9505 4.951 4.9515 4.952
Time (ms)
Coun
t CN TrafficCE Traffic
Message Traffic - Phase 1 (CN:12, X:6, Y:6, R:800, P:32, C:22, Routing:E)
0
10
20
30
40
50
60
4.95 4.9505 4.951 4.9515 4.952
Time (ms)
Coun
t CN TrafficCE Traffic
![Page 58: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/58.jpg)
MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Message Traffic - Phase 2 (CN:12, X:6, Y:6, R:800, P:32, C:22, Routing:E)
0123456789
10
43 45 47 49 51
Time (ms)
Coun
t CN TrafficCE Traffic
Message Traffic - Phase 2 (CN:12, X:6, Y:6, R:800, P:32, C:22, Routing:E)
0123456789
10
43 45 47 49 51
Time (ms)
Coun
t CN TrafficCE Traffic
![Page 59: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/59.jpg)
MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Message Traffic - Phase 1 (CN:12, X:9, Y:4, R:800, P:32, C:22, Routing:E)
0
1
2
3
4
5
6
7
17 18 19 20 21 22 23
Time (ms)
Coun
t CN TrafficCE Traffic
Message Traffic - Phase 1 (CN:12, X:9, Y:4, R:800, P:32, C:22, Routing:E)
0
1
2
3
4
5
6
7
17 18 19 20 21 22 23
Time (ms)
Coun
t CN TrafficCE Traffic
![Page 60: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/60.jpg)
MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Message Traffic - Phase 2 (CN:12, X:9, Y:4, R:800, P:32, C:22, Routing:E)
0
1
2
3
4
5
6
7
41 43 45 47 49
Time (ms)
Coun
t CN TrafficCE Traffic
Message Traffic - Phase 2 (CN:12, X:9, Y:4, R:800, P:32, C:22, Routing:E)
0
1
2
3
4
5
6
7
41 43 45 47 49
Time (ms)
Coun
t CN TrafficCE Traffic
![Page 61: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/61.jpg)
DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:200, P:22, C:16, Routing:F)
0123456789
10
1.7 1.8 1.9 2 2.1
Time (ms)
Coun
t ChainingNo Chaining
![Page 62: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/62.jpg)
DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:200, P:22, C:16, Routing:F)
0123456789
2.5 2.7 2.9 3.1 3.3 3.5
Time (ms)
Coun
t ChainingNo Chaining
DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:200, P:22, C:16, Routing:F)
0123456789
2.5 2.7 2.9 3.1 3.3 3.5
Time (ms)
Coun
t ChainingNo Chaining
![Page 63: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/63.jpg)
DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:400, P:22, C:16, Routing:F)
0123456789
3.4 3.5 3.6 3.7 3.8 3.9 4 4.1
Time (ms)
Coun
t ChainingNo Chaining
DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:400, P:22, C:16, Routing:F)
0123456789
3.4 3.5 3.6 3.7 3.8 3.9 4 4.1
Time (ms)
Coun
t ChainingNo Chaining
![Page 64: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/64.jpg)
DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:400, P:22, C:16, Routing:F)
0123456789
5.2 5.7 6.2 6.7
Time (ms)
Coun
t ChainingNo Chaining
DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:400, P:22, C:16, Routing:F)
0123456789
5.2 5.7 6.2 6.7
Time (ms)
Coun
t ChainingNo Chaining
![Page 65: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/65.jpg)
DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)
0123456789
14 16 18 20 22
Time (ms)
Coun
t ChainingNo Chaining
DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)
0123456789
14 16 18 20 22
Time (ms)
Coun
t ChainingNo Chaining
![Page 66: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/66.jpg)
DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)
012345678
21 22 23 24 25 26 27
Time (ms)
Coun
t ChainingNo Chaining
DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)
012345678
21 22 23 24 25 26 27
Time (ms)
Coun
t ChainingNo Chaining
![Page 67: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/67.jpg)
AADAPTIVEDAPTIVE RROUTINGOUTINGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Adaptive Routing - Phase 1 (CN:16, X:8, Y:6, R:800, P:32, C:22)
0123456789
7 8 9 10 11 12 13
Time (ms)
Coun
t Adaptive EAdaptive FAdaptive E/F
Adaptive Routing - Phase 1 (CN:16, X:8, Y:6, R:800, P:32, C:22)
0123456789
7 8 9 10 11 12 13
Time (ms)
Coun
t Adaptive EAdaptive FAdaptive E/F
![Page 68: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/68.jpg)
AADAPTIVEDAPTIVE RROUTINGOUTINGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Adaptive Routing - Phase 2 ( CN:16, X:8, Y:6, R:800, P:32, C:22)
0123456789
26 31 36 41 46Time (ms)
Cou
nt
Adaptive EAdaptive FAdaptive E/F
Adaptive Routing - Phase 2 ( CN:16, X:8, Y:6, R:800, P:32, C:22)
0123456789
26 31 36 41 46Time (ms)
Cou
nt
Adaptive EAdaptive FAdaptive E/F
![Page 69: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/69.jpg)
AADAPTIVEDAPTIVE RROUTINGOUTINGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Adaptive Routing - Phase 1 (CN:16, X:8, Y:6, R:400, P:22, C:16)
0
2
4
6
8
10
12
1.5 2 2.5 3 3.5
Time (ms)
Coun
t Adaptive EAdaptive FAdaptive E/F
Adaptive Routing - Phase 1 (CN:16, X:8, Y:6, R:400, P:22, C:16)
0
2
4
6
8
10
12
1.5 2 2.5 3 3.5
Time (ms)
Coun
t Adaptive EAdaptive FAdaptive E/F
![Page 70: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/70.jpg)
AADAPTIVEDAPTIVE RROUTINGOUTINGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Adaptive Routing - Phase 2 (CN:16, X:8, Y:6, R:400, P:22, C:16)
0123456789
10
7 8 9 10 11 12 13
Time (ms)
Coun
t Adaptive EAdaptive FAdaptive E/F
Adaptive Routing - Phase 2 (CN:16, X:8, Y:6, R:400, P:22, C:16)
0123456789
10
7 8 9 10 11 12 13
Time (ms)
Coun
t Adaptive EAdaptive FAdaptive E/F
![Page 71: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/71.jpg)
• Space-Time Adaptive Processing (STAP) Basics
• Mercury RACE Multicomputer
• Parallelization Approach for STAP
• RACE Network Simulator
• Preliminary Numerical Studies
• Conclusions
Network Communication Time Simulator for Parallel STAP(Recent Accomplishments)
![Page 72: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/72.jpg)
1. Designed and implemented a platform independent simulator.
4. Communication pattern implemented for STAP but may be used for other applications with phased communication pattern.
2. Simulator demonstrates that the Process Set, the CN or CE Message Traffic, the DMA chaining, the adaptive routing, and the scheduling of the messages affects performance.
3. Allows users to experiment with possible current and future configurations.
CCONCLUSIONSONCLUSIONS
![Page 73: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/73.jpg)
Recent Accomplishments
• Network Communication Time Simulator for Parallel STAP
• FPGA Inner-Product Co-Processor Designs for STAP Weight Solver
• Power Prediction Simulator for the Xilinx4000-Series FPGA
![Page 74: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/74.jpg)
FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers
(Recent Accomplishments)
• Overview of STAP Weight Calculation
• Two Candidate STAP Weight Solvers: QR Versus CG
• Two FPGA Inner-Product Circuit Designs
• Numerical Accuracy Studies
![Page 75: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/75.jpg)
References for STAP Weight Solverand FPGA Design
J. Ward, “Space-Time Adaptive Processing for Airborne Radar,” Technical Report 1015, MIT Lincoln Laboratory, Lexington, MA, 1994.
K. C. Cain, J. A. Torres, and R. T. Williams, (R. A. Games, Project Leader), “RT_STAP: Real-Time Space-Time Adaptive Processing Benchmark,” MITRE Technical Report MTR 96B0000021, Feb. 1997.
MCARM Data Files, Rome Laboratory, (http://sunrise.oc.rl.af.mil).
D. G. Luenberger, Linear and Nonlinear Programming, Addison-Wesley, Reading, MA, 1984.
WildOne Hardware Reference Manual, Number 11927-0000, Revision 0.1, Annapolis Micro Systems, Inc., MD, 1997.
![Page 76: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/76.jpg)
Doppler Filter
Weight Computation
Steering Vector
Input Data
Pulse Compress Data Cube Data Cube
Weight Application
ThresholdDetection
Target Decision
Typical STAP Processing Flow
pulses
range
Doppler
range8%
91.5%
0.5%
CovarianceMatrix
![Page 77: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/77.jpg)
STAP CPI Data Cube
1 M
L
1
N
1
PRI (32-128)
Channels(24)
Range(625-2500)
![Page 78: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/78.jpg)
Principle Behind STAP
• Range gates are divided into non-overlapping blocks having a fixed number of range gates
• These blocks are referred to as the Range Segments
1 M
L
1
N
1PRI
Channels Lr
Number of Range Segments = L/Lr
Range Segment
![Page 79: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/79.jpg)
• Works with data on all M Doppler bins and all Nchannels
• Computes and applies a separate adaptive weight to every element and Doppler bin
• The weight vector is of size MN for each range gate.
Space-Time Adaptive Processing
• Fully Adaptive STAP
![Page 80: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/80.jpg)
Space-Time Adaptive Processing
• Characteristics of Fully Adaptive STAP• Requires solving a large system of linear equations• Size of the linear system grows with
• Array size (the number of channels)• Number of pulses
Example: for each instance, if M = 32 and N = 24 then, complexity ≈ (MN)3 = 452,984,830
• Implementation of fully adaptive STAP is impractical• Complexity of each instance is O((MN)3)• Product MN being several hundreds puts it beyond
current capabilities in real-time computing• Instances of the problem must be solved for each
range segment
![Page 81: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/81.jpg)
• Problem is broken down into a number of smaller,more manageable adaptive problems
• STAP applied to these lower dimension problems
Space-Time Adaptive Processing
• Partially Adaptive STAP
![Page 82: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/82.jpg)
• The partially adaptive STAP works with data on
Example: for each instance, if M = 32, N = 24 and K = 3, thencomplexity ≈ M(KN)3 = 11,943,936
Space-Time Adaptive Processing
• All N Channels
• And a few adjacent Doppler bins, denoted as K
• Complexity is reduced to O(M(KN)3), for K<< M
![Page 83: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/83.jpg)
Space-Time Adaptive Processing
• Effective partially adaptive STAP technique
• The architecture consists of
• Doppler processing across all pulse repetition intervals
• Adaptive filtering across• all channels and• K adjacent Doppler bins
Kth- Order Doppler Factored STAP
![Page 84: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/84.jpg)
1 31 ˆ:),(
×=× NN
rkx
r
∑+−=
=bL
rkxrkx
bkR
rLbr
H
rL 1)1(
),(),(1
),(ψ
Kth-Order Doppler Factored STAP
bth Ran
ge
Segm
ent
(with
L rce
lls)N
Cha
nnel
s
Doppler
k (k - 1)(k + 1)
Data matrix needed for calculating covariance matrix for kth Doppler Bin
and bth Range Segment using Kth-OrderDoppler Factored STAP with K = 3
![Page 85: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/85.jpg)
Matrix-Based Derivation of
rr LNLN
bk
3 ˆ:),(
×=×
X
),(),(1
),(),(1),(1)1(
bkbk
bLrkxrkxbk
H
r
Lbr
H
r
L
LR
r
XX
ψ
=
= ∑+−=
sbkwbk =),(),(ψ
The Weight Equation:
),( bkψ
![Page 86: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/86.jpg)
FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers
(Recent Accomplishments)
• Overview of STAP Weight Calculation
• Two Candidate STAP Weight Solvers: QR Versus CG
• Two FPGA Inner-Product Circuit Designs
• Numerical Accuracy Studies
![Page 87: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/87.jpg)
Methods for STAP Weight Calculation
• Two approaches to solve the weight equation
• QR-decomposition method (direct)
• Conjugate Gradient method (iterative)
![Page 88: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/88.jpg)
STAP Weight Calculation
sLbkwRR
RR
sbkwRRL
bkwRQQRL
QRbk
sbkwbkbkL
sbkwbk
rT
TT
T
r
TT
r
T
H
r
=
=
==
=
=
=
),(
]0[ that Note
),(1),(1
),( :onDecomposti QR Take
),(),(),(1
),(),(
*11
1
***
X
XX
ψ
onsubstituti backward using ),(for Solve
),(
neliminatio forward using for Solve
),(Let
*1
1
*1
bkw
pbkwR
p
sLpR
pbkwR
rT
=
=
=
sw =ψ :Equation Weight thesolve toMethodion decomposit-QR Using
![Page 89: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/89.jpg)
Iteration
STAP Weight Calculation
Initialization
ikTi
iTi
ii
ii
ii
Ti
iTi
ii
ddd
dggd
swg
ddd
dgww
+−=
−=
−=
+++
++
+
)(1
11
11
1
ψψ
ψ
ψ
sw =ψ :Equation Weight thesolve toMethodGradient Conjugate Using
00000 ,set , Choose dgwsdw −=−= ψ
![Page 90: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/90.jpg)
Preliminary Numerical Studies
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1
Tolerance
Rel
ativ
e Er
ror
Lr = 25010-1
10-2
10-3
10-4
10-5
10-6
10-7
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1
Rel
ativ
e Er
ror
100
10-1
10-2
10-3
10-4
10-5
10-6
10-7
10-8
10-9
Tolerance
Lr = 125
qr
cgqr
w
ww −=Error Relative
![Page 91: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/91.jpg)
Preliminary Numerical Studies
Lr = 125
Flop
Cou
nt
108
109
1010
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1
Tolerance
CGQR
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1
Lr = 250
Tolerance
1010
109
108
Flop
Cou
nt
Tolerance
CGQR
![Page 92: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/92.jpg)
FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers
(Recent Accomplishments)
• Overview of STAP Weight Calculation
• Two Candidate STAP Weight Solvers: QR Versus CG
• Two FPGA Inner-Product Circuit Designs
• Numerical Accuracy Studies
![Page 93: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/93.jpg)
Motivation for FPGA Inner-Product Co-Processors
• Inner-products are a core calculation for both CG- and QR-based STAP weight solvers
• Computations are highly numeric and regular
• Opportunities to exploit reduced precision arithmetic
• Control flow of CG and QR best implemented on GPP or DSP - Inner product calculations can be offloaded to available FPGA resources
![Page 94: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/94.jpg)
PCI BUSPCI BUS
Dual Port MemController 0
Dual Port MemController 0
Dual Port MemController 1
Dual Port MemController 1
Processing Element
1
Processing Element
1
Processing Element
0
Processing Element
0Fifo1Fifo1Fifo0Fifo0
SIMDConnector
External I/OConnector
Overview of WildOne Architecture
![Page 95: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/95.jpg)
+
Output Register
a b
Sign+16 bitmantissa
Normalizing unit
1’s comp/register
a bsign of a
a b
b
BUFFER
X
BUFFER
FPGA
BOARD
INTERCONNECTION
BUS
HOSTPROCESSOR
• Multiply-Accumulate Pipe• Reads two operands
per cycle • Performs two operations
per cycle• Performs exponent
normalization prior to accumulation
• 2 N-vectors reduced to a constant number of partial sums
FPGA Inner Product Co-Processor:Design 1
![Page 96: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/96.jpg)
• Multiply-Add Reduction Pipe• Reads four operands
per cycle • Performs three operations
per cycle• No normalization required• 2 N-vectors reduced to N/2 partial sums
• Basic Tradeoff: First design has lower throughput, but can perform more work
X X
1’s comp/register
Sign bSign a
+
Sign+16 bit mantissa
INTERCONNECTION
BUS
HOSTPROCESSOR
BUFFER
BUFFER
FPGA
BOARD
2 ff
Data forFirst
Multiplier
Data forSecond
Multiplier
Unitclocked
here
FPGA Inner Product Co-Processor:Design 2
![Page 97: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/97.jpg)
Block Floating Point Unit
Inner-ProductCo-Processor
1
1
UML Description of Basic Co-Processor Design
![Page 98: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/98.jpg)
Block Floating Point Unit
Multiplying Unit Complementor
Normalizing UnitAccumulator
1 1
1
11
1
1 1
UML Description of Block Floating Point Unit
![Page 99: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/99.jpg)
Multiplying UnitRegister
4-Bit Adder
Multiply Stage1
1
132
4
8
UML Description of Multiplying Unit
![Page 100: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/100.jpg)
Accumulator
4-Bit Adder Register
3-Bit Adder
1
5
1
1
124
UML Description of Accumulator Unit
![Page 101: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/101.jpg)
Normalizing Unit
SubtractorRegister
MagnitudeComparator
*
1 1 1
1
1
UML Description of Normalizing Unit
![Page 102: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/102.jpg)
Host ProgramWild-One
Open board
Program the board with the image
Interrupt for Exponent
Exponent written to FIFO
Interrupt for Mantissa Vectors
Mantissa Vectors written to the FIFO
Processing Done ans in FIFO/Memory
Read back the answer
Close the board
TI
ME
MESSAGES
Sequence Diagram for Interactions between Host and FPGA Board
![Page 103: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/103.jpg)
Get Exponent Wait forExponent Int
Int Req
Int AckRead Exponent
Get Mantissa
Read Mantissa Write Mantissa
Wait forMantissa Int
Write Exponent
Int Req
Int Ack
Multiply-and-add/accumulate
Write Back
Wait for Answer Int
Read Back Answer
Ack = 1
Ack = 0
Req = 1
Req = 0
Req = 1Ack = 1
Ack = 0 Req = 0
Req = 1
Done = 1
Req = 0Processing Sub-System
FPGA
Board
Host
System
Statechart Diagram for Interactions between Host and FPGA Board
![Page 104: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/104.jpg)
Compare Count [Count = Threshold]
Read Two Operands
Multiply
Accumulate
[Count ≠ Threshold]
Write to MemoryFeedback SumIncrement Count
Set Done flag
Circuit Activity Diagram:Design 1
![Page 105: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/105.jpg)
Compare Count [Count = Threshold]
Read Two Operands
Multiply
Add
[Count ≠ Threshold]
Read Next Two Operands
Multiply
Write to Memory
Increment Count Set Done flag
Circuit Activity Diagram:Design 2
![Page 106: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/106.jpg)
FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers
(Recent Accomplishments)
• Overview of STAP Weight Calculation
• Two Candidate STAP Weight Solvers: QR Versus CG
• Two FPGA Inner-Product Circuit Designs
• Numerical Accuracy Studies
![Page 107: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/107.jpg)
Setup for Numerical Accuracy Studies
• Randomly generated, 512-element test vectors processed by both designs
• Range of vectors’ data values controlled to study effect dynamic range has on accuracy
• Output of each circuit compared to corresponding results calculated on host (using IEEE 32-bit floating point arithmetic)
• Accuracy metric is ratio of obtained values to corresponding IEEE floating point value
![Page 108: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/108.jpg)
Zero Order of Magnitude Experiment
Data Histogram
05
101520253035404550
0.00
1
0.06
3
0.12
6
0.18
8
0.25
1
0.31
3
0.37
6
0.43
8
0.50
0
0.56
3
0.62
5
0.68
8
0.75
0
0.81
3
0.87
5
0.93
8
1.00
0
Freq
uenc
y
Exponent Histogram
0
100
200
300
400
500
600
114
116
118
120
122
124
126
128
130
132
134
Freq
uenc
y
Accuracy HistogramDesign 2
020406080
100120140160180
0.99
84
0.99
85
0.99
86
0.99
87
0.99
88
0.99
89
0.99
90
0.99
91
0.99
92
0.99
93
0.99
94
0.99
95
0.99
96
0.99
97
0.99
98
0.99
99
1.00
00
Freq
uenc
y
Accuracy HistogramDesign 1
0
1
2
3
4
5
6
7
8
0.999855 0.99986375 0.9998725 0.99988125 0.99989
Freq
uenc
y
![Page 109: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/109.jpg)
Two Orders of Magnitude Experiment
Accuracy HistogramDesign 1
0
1
2
3
4
5
6
7
0.999893 0.9999015 0.99991 0.9999185 0.999927
Freq
uenc
y
Data Histogram
05
101520253035404550
0 7 14 21 27 34 41 48 55 62 69 76 82 89 96 103
110
Freq
uenc
y
Exponent Histogram
050
100150200250300350400450500
119
121
123
125
127
129
131
133
135
137
139
141
143
145
Freq
uenc
y
Accuracy HistogramDesign 2
0
50
100
150
200
250
0.99
399
0.99
436
0.99
474
0.99
511
0.99
549
0.99
586
0.99
624
0.99
661
0.99
699
0.99
736
0.99
774
0.99
811
0.99
849
0.99
886
0.99
924
0.99
961
0.99
999
Freq
uenc
y
![Page 110: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/110.jpg)
Four Orders of Magnitude Experiment
Accuracy HistogramDesign 1
0
1
2
3
4
5
6
7
8
9
0.999889 0.99989925 0.9999095 0.99991975 0.99993
Freq
uenc
y
Data Value Histogram
05
1015
2025
3035
4045
50
0
687
1373
2060
2747
3434
4120
4807
5494
6180
6867
7554
8241
8927
9614
1030
1
1098
5
Freq
uenc
y
Exponent Histogram
0
50
100
150
200
250
300
350
400
450
119
121
123
125
127
129
131
133
135
137
139
141
143
145
Freq
uenc
y
Accuracy HistogramDesign 2
0
50
100
150
200
250
300
0.46
7
0.50
0
0.53
4
0.56
7
0.60
0
0.63
4
0.66
7
0.70
0
0.73
3
0.76
7
0.80
0
0.83
3
0.86
7
0.90
0
0.93
3
0.96
7
1.00
0
Freq
uenc
y
![Page 111: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/111.jpg)
Five Orders of Magnitude Experiment
Accuracy HistogramDesign 1
0
1
2
3
4
5
6
7
8
0.999912 0.99991875 0.9999255 0.99993225 0.999998
Freq
uenc
y
Data Value Histogram
05
101520253035404550
0
6867
1373
4
2060
2
2746
9
3433
6
4120
3
4807
0
5493
7
6180
5
6867
2
7553
9
8240
6
8927
3
9614
1
1030
08
Freq
uenc
y
Exponent Histogram
0
100
200
300
400
500
600
700
800
119 121 123 125 127 129 131 133 135 137 139 141 143
Freq
uenc
y
Accuracy HistogramDesign 2
0
50
100
150
200
250
300
0.00
000
0.06
250
0.12
500
0.18
750
0.25
000
0.31
249
0.37
499
0.43
749
0.49
999
0.56
249
0.62
499
0.68
749
0.74
999
0.81
249
0.87
499
0.93
748
0.99
998
Freq
uenc
y
![Page 112: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/112.jpg)
“Outlyer” Experiment
Accuracy HistogramDesign 2
0
5
10
15
20
25
30
35
40
45
50
0.00
0.06
0.12
0.17
0.23
0.29
0.35
0.40
0.46
0.52
0.58
0.64
0.69
0.75
0.81
0.87
0.92
Freq
uenc
y
Exponent Histogram
0
100
200
300
400
500
600
114
116
118
120
122
124
126
128
130
132
134
136
138
Freq
uenc
y
Data Value Histogram
0
200
400
600
800
1000
1200
0.00
09
62.5
008
125.
0007
187.
5007
250.
0006
312.
5006
375.
0005
437.
5005
500.
0004
562.
5004
625.
0003
687.
5003
750.
0002
812.
5002
875.
0001
937.
5001
1000
.000
0
Freq
uenc
y
Accuracy HistogramDesign 1
0
2
4
6
8
10
12
0.593067 0.6398925 0.686718 0.7335435 0.78369
Freq
uenc
y
outlyeroutlyer
![Page 113: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/113.jpg)
Conclusions
• CG weight solver provides tradeoff between accuracy and required FLOPs(compared to QR weight solver)
• Tradeoff between two FPGA designs: Design 1 (Mult & Accum) has lower peak throughput, but can perform more total work than Design 2
• Block floating point provides acceptable accuracy for uniformly distributed data over reasonable dynamic ranges
• Block floating point accuracy breaks down when there are a few large outlyers in the data set
![Page 114: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/114.jpg)
Recent Accomplishments
• Network Communication Time Simulator for Parallel STAP
• FPGA Inner-Product Co-Processor Designs for STAP Weight Solver
• Power Prediction Simulator for the Xilinx4000-Series FPGA
![Page 115: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/115.jpg)
Power Prediction Simulator for the Xilinx 4000-Series FPGA
(Recent Accomplishments)
• CMOS Power Consumption and Past Research
• Design and Implementation of the Power Prediction Simulator
• Preliminary Experimental Results
• Conclusions and Current Work, Demo
![Page 116: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/116.jpg)
References for FPGA Power Prediction
K. P. Parker and E. J. McCluskey, “Probabilistic Treatment of General Combinatorial Networks,” IEEE Trans. Computers, Vol. C-24, June 1975, pp. 668-670.
Kaushik Roy and Sharat Prasad, “Circuit Activity Based LogicSynthesis for Low Power Reliable Operations,” IEEE Trans. VLSI Systems, Vol. 1, No. 4, Dec.1993, pp.
Kaushik Roy, “Power Dissipation Driven FPGA Place and Route under Timing Constraints,” School of Electrical and Computer Engineering, Purdue University.
“XC4000 Series Field Programmable Gate Arrays,” Xilinx, Inc., September 18, 1996.
![Page 117: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/117.jpg)
Leakage CurrentDynamic Capacitance Charging Current
Most important for CMOSDependant on clock frequency
Power Dissipation in CMOS
Transient Current
Dependant on signal activityDependant on signal activity
![Page 118: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/118.jpg)
Power Equations
Equivalent model of a transistor’s gate...
( )
−=
−RC
teVtvc 1
( ) RCt
VetvR
−=
( )ReVtp
RCt
R
22
−
=
∫∫−
−
−−
==ττ
ττ0
22
0
22 2
21 dte
RCCVdt
ReVp RC
tRCt
avg
222
21
2CVeCVp
o
RCt
avg ττ
τ
≈−
=−
![Page 119: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/119.jpg)
( ) 50.0=clockp
( ) 88.01 =xp
( ) 29.02 =xp
( ) 69.03 =xp ( ) 27.03 =xA
( ) 0.1=clockA
( ) 10.01 =xA
( ) 17.02 =xA
p(s): the probability that signal sattains a logical value of true at any given clock cycle.
A(s): the probability that signal stransitions at any given clock cycle.
Probabilistic Modeling
![Page 120: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/120.jpg)
Probabilistic Modeling
x3
x2
x1
y
y
x3
x2
x1
:)(1 tx:)(2 tx:)(3 tx
:)(21 txx:)(321 txxx
p=0.88, A=0.10
p=0.29, A=0.17
p=0.69, A=0.27
p=0.83, A=0.17
p=0.10, A=0.13
Calculation of average power:
∑∈
=gates all
2
21
ggavg ACVP
![Page 121: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/121.jpg)
Probabilistic Equations
( )
( )1 where,)(1
1
===
=
∏∑
∑ ∏
=
=
ii
k
ii
k
ii
Pyp
f
ππ
( ) ( )
( ) ( ){ }
( ) ( ){ }
∑∑ ∏
∑ ∏
∑ ∏
+
−⊕+
−⊕+
−⊕
⋅=
===≠≠ ∉
==≠ ∉
= ≠
X n
kjikji kjil
llkkjjiikji
n
jiji jik
kkjjiiji
n
i ijjjiii
xzPxzPxzPxzPzzzXfXf
xzPxzPxzPzzXfXf
xzPxzPzXfXf
XPyA
K
1,1,1,,
1,1,
1
)(1)()()(),,;()(31
)(1)()(),;()(21
)(1)();()(
)()(
*
* Probabilistic Treatment of General Combinatorial Networks† Estimation of Circuit Activity Considering Signal Correlations and Simultaneous Switching
Signal probability transformations...
Signal activity transformations...†
![Page 122: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/122.jpg)
Power Prediction Simulator for the Xilinx 4000-Series FPGA
(Recent Accomplishments)
• CMOS Power Consumption and Past Research
• Design and Implementation of the Power Prediction Simulator
• Preliminary Experimental Results
• Conclusions and Current Work, Demo
![Page 123: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/123.jpg)
FPGA Design
FPGA internal structure design...
CLB
IOB BUF
![Page 124: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/124.jpg)
Routing Fabric Design
Example routings...
Xilinx 4000 series routing fabric is very intricate.
Xilinx synthesis tools use shortest path routing where possible.
The distance the signal travels is the metric considered in this model.
![Page 125: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/125.jpg)
Signal Design
Symbolic Probability
Numeric Probability
Numeric Activity
Signal Reference
Manhattan Distance
CLBCLB
R
L
Local Signal Remote Signal
![Page 126: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/126.jpg)
Iteration Example
4
4 InterconnectionLUT
LUT
LUT
LUT
LUT
LUT
![Page 127: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/127.jpg)
Iteration Example
R
R
R
R
R
R
R
R
L
L
L
RRRR
RRRR
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
L
L
L
RRRR
RRRR
R
R
R
R
R
R
R
R
LUT
LUT
LUT
LUT
LUT
LUT
L
L
L
L
![Page 128: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/128.jpg)
Probabilistic Feedback Example
ab
d
ec pe
pa
pb
• Feedback Circuits Require Symbolic Iteration of Probability Expressions
• Assume pa , pb , pe are known; then pd and pc are determined using iteration
pd
d = a + bc
dc
pc
c = d e
Iteration 1:
pd = pa
pc = pa pe
Iteration 2:
pd = pa + pa pb pe
pc = (pa pe + pa pb pe) pe = pa pe
Iteration 3:
pd = pa + pa pb pe
pc = pa pe
![Page 129: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/129.jpg)
Power Prediction Simulator for the Xilinx 4000-Series FPGA
(Recent Accomplishments)
• CMOS Power Consumption and Past Research
• Design and Implementation of the Power Prediction Simulator
• Preliminary Experimental Results
• Conclusions and Current Work, Demo
![Page 130: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/130.jpg)
Experimental Results
Probabilistic signals are correctly propagated through combinational and sequential logic.
Configurations making use of feedback converge for all test cases.
Probabilistic modeling is more than an order of magnitude faster than time-domain modeling techniques.
![Page 131: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/131.jpg)
Convergence of Probabilistic Signals
Probability Convergence
020406080
100120
0 5 10
Iterations
% C
onve
rgen
ce
Adder4FIFOPipeAdderMult32
All test cases converged in the following manner:Steep Slope: Signals not involved with feedback rapidly
propagated through the FPGA.Plateau: Signals dependent on feedback converge slowly.
![Page 132: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/132.jpg)
Symbolic Term Explosion
Mixing 12 signals this way...
…gives 6 signals with at most 4 terms.
Mixing 12 signals this way...
…gives 1 signal with at most 4096 terms.
![Page 133: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/133.jpg)
Power Measurements
• Heat Measurements
• Developed hardware instrumentation to measure surface temperature of FPGA
• Thermistor attached to FPGA with heat conductive epoxy
• Instrumentation accurate to within 0.1 degrees F
![Page 134: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/134.jpg)
Frequency Response of the FPGA
• The FPGA consumes more power as its clock frequency rises.• The simulator gives 125mW +43.6mW/MHz for this situation.
120135150165180
0 10 20 30 40 50Frequency (MHz)
Tem
pera
ture
(F)
Surface Temperature versus Frequency
![Page 135: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/135.jpg)
Power Prediction Simulator for the Xilinx 4000-Series FPGA
(Recent Accomplishments)
• CMOS Power Consumption and Past Research
• Design and Implementation of the Power Prediction Simulator
• Preliminary Experimental Results
• Conclusions and Current Work, Demo
![Page 136: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/136.jpg)
Conclusions and Current Work
• Designed and Implemented power prediction simulator for Xilinx 4000 series FPGAs.
• Inputs to simulator:• Place & Route bit stream (from Xilinx Tool)• Activity and Probability factors for pin signals
• Simulator calculates probabilities and activities for all internal signals
• Tool outputs power consumption of FPGA chip
• Currently calibrating/tuning simulator using both heat and DC current measurement cross-calibration methods
![Page 137: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/137.jpg)
OutlineOutline
• Program Overview and Introduction (Quad Chart)
• Program Management Status
• Recent Accomplishments
• Status of Deliverable Checklist
![Page 138: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/138.jpg)
Deliverables
• Prototype VME-Based GPP/DSP/FPGA platform– 20 Slot Chassis with SPARC 5V Host– 9U VME RACE Board– 2 SHARC Daughtercards:12 SHARCs, 48MB – 2 PowerPC Daughtercards: 4 PowerPCs, 64MB– VME WILDFIRE Array Card (16 Xilinx 4028EX-3s)
√
√
√
![Page 139: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/139.jpg)
Deliverables
• FPGA Power Prediction Simulator– Simulator Input: Probabilistic Input Data
Characteristics; FPGA configuration data file– Simulator Output: Power Prediction to within 10%
relative accuracy (expected)– Will demonstrate fidelity across different applications
and even different implementations of the same design– Will operate at interactive speeds – Completely Portable Java Implementation
√
√
√
√
√
![Page 140: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/140.jpg)
Deliverables
• Network Simulator for Parallel STAP– Network Feature Inputs: number and types of
switching elements; interconnection scheme; number and types of processors at each network port, etc.
– Data Mapping Input: Data layout across the processors for each phase of processing
– Data Ordering Input: Order in which data items at each network port are to be transmitted
– Simulator Output: Number of network cycles required for all phases of STAP communication
– Relative accuracy of simulator 10% (expected)– Will operate at interactive speeds – Completely Portable Java Implementation
√
√
√
√
√
√
√
![Page 141: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/141.jpg)
Deliverables
• Linear Filtering Implementation on FPGA– Investigation of different data formats and arithmetic
approaches for FPGA calculations– Demonstrate performance improvement (throughput
and/or power) over GPP/DSP implementation
• STAP Weight Equation Solver on GPP/DSP/FPGA System– Investigation of different data formats and arithmetic
approaches for FPGA calculations– Demonstrate performance improvement (throughput
and/or power) over GPP/DSP implementation
√
√
√
![Page 142: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/142.jpg)
Deliverables
• Optimal configuration techniques for executing SAR on GPP/DSP/FPGA system– Based on optimally balancing memory and processor
utilization, selection of most appropriate data formats and arithmetic techniques, etc.
– Will utilize the FPGA power prediction simulator– Will optimally integrate most appropriate FPGA
circuit implementations and GPP/DSP algorithms– Optimization techniques based on proven
mathematical programming methods– Will demonstrate 2 to 10 times power savings over
nominal configurations of GPP/DSP systems
√
√
√
![Page 143: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/143.jpg)
Deliverables
• Optimal configuration techniques for executing STAP on GPP/DSP/FPGA system– Techniques based on optimal data layout to minimize
latency through interconnection network, optimal combined use of processors and FPGAs for intensive weight calculation, will include desired numerical accuracy as an input parameter
– Will utilize the FPGA power prediction simulator and the network simulator for parallel STAP
– Will demonstrate 2 to 10 times power savings over nominal configurations of GPP/DSP systems
– Optimization techniques based on proven mathematical programming methods
√
√
![Page 144: Optimal Configuration of Combined GPP/DSP/FPGA Systems for ...antonio/pubs/p-ann_rev98acs.pdf · Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor](https://reader030.fdocuments.us/reader030/viewer/2022040910/5e836580d0280259fb707bf1/html5/thumbnails/144.jpg)
Deliverables
• Optimal configuration techniques for SAR and STAP on GPP/DSP/FPGA system– Will generalize the SAR-only and STAP-only
configuration techniques– Will consider how to best configure the
GPP/DSP/FPGA to simultaneously satisfy both the SAR and STAP requirements and minimize power consumption
– Will demonstrate 2 to 10 times power savings over nominal configurations of GPP/DSP systems
– Optimization techniques based on proven mathematical programming methods