12/10/04 1 Virtual Prototyping of Advanced Space System Architectures Based on RapidIO: Phase II...
-
Upload
sabrina-hodges -
Category
Documents
-
view
216 -
download
0
Transcript of 12/10/04 1 Virtual Prototyping of Advanced Space System Architectures Based on RapidIO: Phase II...
112/10/04
Virtual Prototyping of Advanced Space System Architectures Based on RapidIO: Phase II Report
Sponsor:Sponsor:
Honeywell Space Systems, Clearwater, FL
Principal Investigator:Principal Investigator:
Dr. Alan D. George
Funded Graduate AssistantsFunded Graduate Assistants::
David Bueno, Ian Troxel, Chris Conger
Additional Graduate Assistants:Additional Graduate Assistants:
Adam Leko
HCS Research Laboratory, ECE Department
University of Florida
212/10/04
Presentation Outline Project Motivation and Goals Project Tasks Overview SAR Algorithm Flow RapidIO Logical I/O Layer Models MLD/RapidIO/Altia GUI SAR System Designs Previous Work Summary Experiments and Results GUI Demo Conclusions Future Work Collaboration Possibilities
312/10/04
Project Motivation and Goals Simulative analysis of Space-Based Radar (SBR) systems using
RapidIO interconnection networks A high-performance, switched interconnect for embedded systems Good scalability with better bisection bandwidth than bus-based designs
Build upon work from previous semesters RapidIO simulation model constructions set GMTI partitioning, system design and performance evaluation RapidIO switch and routing issues investigation
Study optimal method of constructing scalable RIO-based systems for Synthetic Aperture Radar (SAR) Identify system-level tradeoffs in system designs Discrete-event simulation of RapidIO network,
processing elements, and SAR algorithm Identify limitations of RIO design for SBR Determine effectiveness of various SAR algorithm
partitionings over RIO networkImage courtesy http://www.afa.org/magazine/aug2002/0802radar.asp
412/10/04
Project Tasks Overview RIO and SAR Modeling
Updated GMTI results Developed RIO Logical I/O Layer model SAR-specific models developed
Created standard and double-buffered version of SAR Numerous experiments performed
Demonstration GUI Altia tool acquired and linked to MLD Generic RIO demo interface created SAR-RIO demo interface created
RapidIO Testbed RIO physical and logical layer cores acquired from Xilinx Two Virtex-II Pro development boards Acquiring test equipment Examining other RIO core options including GDA
512/10/04
RapidIO Logical I/O Layer Model RapidIO three-layer architecture
Logical (end-to-end), transport, physical
Previously developed RIO models use RapidIO Message Passing Logical Layer
New I/O Logical Layer model provides memory-mapped reads and writes Well suited to our global memory-
based SAR approach Provides potentially increased
performance through responseless writes
Allows GM board to function without algorithm knowledge With MP logical layer GM board must
“send” data to processors, with IO, processors just “read” memory
Easy “plug-and-play” compatibility with existing RIO physical layer models
MLD Logical I/O Model
612/10/04
Merged Logical Layers
Packets coming into logical layer from physical layer or application layers are routed to appropriate logical layer component (Logical I/O or Message Passing)
Select appropriate logical layer block
Logical layer blocks
MLD Merged Logical Layers Model
712/10/04
Altia GUI Interface to MLD Ported graphical interface tool Altia to MLD
Altia designed to integrate with arbitrary C-based applications Created MLD library of Altia components for fast integration Use Altia to create custom, useful GUI to control/monitor simulations
Total of 3 components used to interface Altia Initialization module – launch GUI, register connections Input module – receive data from Altia, pass to MLD simulation Output module – send data from MLD simulation to Altia
Altia-MLD interface components
812/10/04
Altia GUI Interface to MLD Designed GUI demo systems to illustrate potential
Two systems constructed, designed using our RapidIO construction set SBR-demo system provides real-time performance visualization for
existing SAR and GMTI simulations Input-demo system allows user to control system behavior through GUI
controls, observe system reaction in real-time
Altia interface module as seen in MLD SBR-demo GUI to visualize simulations
912/10/04
SAR Algorithm Flow SAR composed of 7 sub-tasks
2-dimensional data set (image), processed iteratively Due to large image size, compute nodes must process portions of the
total data set, looping until entire image is processed for each sub-task Processed out of global memory, cyclic read-compute-write Each sub-task’s optimal data partitioning varies
Data size stays constant throughout algorithm** As opposed to the monotonically-decreasing data size of GMTI algorithm
Extensive data gathering and image processing time 16-second Coherent Processing Interval Data image potentially as large as 8GB
Range-Pulse Compression
Polar Reformatting
Pulse FFT
Range FFT
Pulse FFT
Auto-focus
Magnitude Function
Range dimension
Pulse dimension
Range/pulse blocks
** - Final sub-task reduces data size by ½, otherwise data size remains constant until the final step
1012/10/04
Partitioning Methods
Chose straightforward partitioning for SAR due to latency considerations Each chunk split across all processors Pipelined or staggered methods would incur extremely high
latencies for full-size images due to 16s CPI Pipelined latency ~= Number of stages * CPI Staggered latency ~= Number of groups * CPI Straightforward latency ~= CPI
Possible other acceptable partitionings Staggered-by-chunk
Split each chunk across each 4-processor board instead of across all nodes Possibly increase efficiency without a latency penalty
1112/10/04
7-Board System
SAR Backplane and System Models
High bandwidth requirements for GMTI algorithm dictate architecture for SAR systems All systems must eventually support SAR and GMTI Same backplane efficiently supports four-, five-, six-, and seven-board configurations
Can use smaller switches on backplane to conserve power if fewer than seven boards needed Three-board system possible using similar configuration with only two backplane switches
4-Switch Non-blocking Backplane
Backplane-to-Board 0, 1, 2, 3 Connections
Backplane-to-Board 4, 5, 6, and Data Source/GM Connections
1212/10/04
Previous Work Summary Studied RapidIO system designs for
space-based GMTI Important conclusions:
Non-blocking backplane extremely important for GMTI
GMTI not sensitive to latency of individual packets Cut-through routing unnecessary in
switches RapidIO transmitter- or receiver-
controlled flow control perform nearly equally
Straightforward partitioning method provides lowest latency, but least efficient use of resources
Staggered partitioning (by board) method very efficient, but has very high latency
Pipelined method a compromise
CPI Latencies
0
256
512
768
1024
1280
1536
32000 40000 48000 56000 64000
Number of ranges
La
ten
cy (
ms)
Straightforward, 7boards
Staggered, 5 boards
Pipelined, 6 boards
Pipelined, 7 boards
1312/10/04
Experimental Baseline Setup Simulation parameters
Systems use 250 MHz, 16-bit RapidIO links Central-memory, store-and-forward switches For other parameters see Appendix
SAR image sizes Most simulations run using much smaller images to keep
simulation runtime manageable (a 16s CPI is a LOT to simulate) Standard image size simulated is 2048 x 2048 Other sizes run include 4096 x 4096, as well as 16384 x 16384
Full-sized simulation runs verify that performance scales linearly because image is broken into “chunks” Doesn’t matter if you simulate 500 chunks or 500000 chunks as long
as chunk-size is consistent, because simulation is doing same thing over and over!
To determine performance of “real” system, simply multiply simulated CPI result latency by (desired image size/simulated image size) CPI Latency (actual) = CPI Latency (simulated) × (desired image size ÷
simulated image size)
1412/10/04
Scalability of SAR Results Table below displays computation of predicted performance for
16k x 16k image size based on 2k x 2k image size Numbers shown are a sample of metrics reported from simulation Error in approximation basically negligible Most important metric is CPI completion latency
Enormous time savings by simulating scaled-down processing loads
Message Passing
16 Nodes
128KB chunks2k x 2kactual
16k x 16kactual
16k x 16k predicted % error
CPI completion latency
213,671,662 13,767,812,846 13,674,986,368 0.006742282
RequestStats
Sum of Delay 173,389,298,614 11,106,406,501,800 11,096,915,111,296 0.000854587
Sum of Bytes 333,710,512 21,357,397,168 21,357,472,768 -0.000003540
Number of Transactions
1,245,244 79,691,839 79,695,616 -0.000047433
Response Stats
Sum of Delay 361,428,208 22,798,782,846 23,131,405,312 -0.014589483
Sum of Bytes 14,942,928 956,302,032 956,347,392 -0.000047433
Number of Transactions
1,245,244 79,691,836 79,695,616 -0.000047433
1512/10/04
Scalability of SAR Results Both Logical I/O and MP layers produce predictable, linearly-
scalable results Request/response statistic predictions slightly less accurate for Logical
I/O systems It will be shown that Logical I/O is more susceptible to network
contention, resulting in marginal reduction in prediction accuracy
Logical I/O
16-Nodes
128KB chunks2k x 2kactual
16k x 16kactual
16k x 16k predicted
% error
CPI completion latency
187,221,141 11,923,995,014 11,982,153,024 -0.004877393
RequestStats
Sum of Delay 193,415,004,394 12,425,814,223,000 12,378,560,281,216
0.003802885
Sum of Bytes 170,917,888 10,938,744,832 10,938,744,832 0.000000000
Number of Transactions
1,245,184 79,691,776 79,691,776 0.000000000
Response Stats
Sum of Delay 54,130,881,314 2,739,210,856,070 3,464,376,404,096 -0.264735205
Sum of Bytes 175,636,480 11,240,734,720 11,240,734,720 0.000000000
Number of Transactions
655,360 41,943,040 41,943,040 0.000000000
1612/10/04
SAR Results: Logical I/O vs. Message Passing Logical Layer Performance Message passing logical layer
much higher overhead All operations have responses CPI latency constant as chunk
size increases Low levels of contention in
network for all cases
Logical IO layer inherently better for current mapping Approximately ½ of operations
are writes, which require no response
Contention increases as chunk size increases (see next slide)
CPI Completion Latency vs. Chunk Size, optimized
170.000
180.000
190.000
200.000
210.000
220.000
64 128 256 512 1024
Per-Processor Chunk Size (KB)
La
ten
cy (
ms)
MP
IO
Parallel Efficiency vs. Chunk Size
0.65
0.7
0.75
0.8
0.85
0.9
64 128 256 512 1024
Per-Processor Chunk Size (KB)
Effi
cie
ncy
MP
IO
1712/10/04
Logical IO Performance With message-passing model, “smart” GM board handles arbitration for data
to processors GM sends to processors when it is ready (~16 proc nodes, 4 GM RIO ports -> GM
the bottleneck and controls the traffic) With I/O model, all processors start issuing “reads” to the GM when they
want data
Board 0 Switch Memory Histogram
05
101520253035404550
Free Switch Memory
64K
1M
Floods the network with read requests Contention increases as chunk size
increases (see switch memory histogram to right)
Figure to right shows switch spending most of its time with low free memory
Potential solution? Add some synchronization elements to
processors to avoid having everyone ask for huge chunks of data at once
Let N processors ask at a time, where N = number of GM nodes
Working on implementation and will provide results in addendum to this report
1812/10/04
SAR Results: Cut through vs. Store and Forward Routing Similar to GMTI, adding cut-through routing capabilities to switches does
not improve performance overall system performance Efficiency chart below shows no major benefit for using cut-through routing Chart shows Message Passing and Logical I/O layers, as well as cut-through vs.
store-and-forward routing
Parallel Efficiency
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
64 128 256 512 1024
Per-Processor Chunk Size (KB)
Eff
icie
ncy MP CT
MP SNF
IO CT
IO SNF
1912/10/04
SAR Results: Comparison with GMTI SAR simulations all run on same backplane as GMTI
Consider comparable metrics between the two to gauge system fitness for both algorithms
Side note: Logical I/O scales slightly better for SAR! Parallel efficiency graphs below reveal that SAR runs more
efficiently than GMTI on the same system SAR algorithm is very memory and compute intensive, but does not
stress the network as much as GMTISAR Parallel Efficiency vs. System Size
0
0.2
0.4
0.6
0.8
1
12 16 20 24 28
System Size (Nodes)
Effi
cien
cy
MP
IO
GMTI Parallel Efficiency vs. Data Size
0
0.2
0.4
0.6
0.8
1
32000 40000 48000 56000 64000
Data Cube Size (ranges)
Eff
icie
ncy
Straightforward, 5Board
2012/10/04
SAR Results: Double-Buffering Double-buffering allows reception of one “chunk” while processing is being
performed on the previous chunk (using Logical I/O layer) Depending on board architecture, requires 2-3x more memory available on-
board Early double-buffering experiments show CPI latency improvements for
smaller chunk sizes (due to communication/computation overlap) However, double-buffering
increases system contention as chunk sizes grow even more than standard Logical I/O SAR application Possible that certain phases of the
algorithm (more compute-heavy) will benefit more from double-buffering, while others should be left single-buffered (will explore this possibility further)
Once algorithm scaling to 16x16 taken into account, almost .5s in latency can be saved by double-buffering
CPI Completion Latency vs. Chunk Size
165
170
175
180
185
190
195
200
205
210
64 128 256 512 1024
Per-Processor Chunk Size (KB)
Re
sult
La
ten
cy (
ms)
2xBuffer
1xBuffer
2112/10/04
SAR/RIO Demo
DEMO
2212/10/04
Conclusions RapidIO simulation and modeling capabilities extended
SAR algorithm simulated and analyzed, compared with GMTI Logical I/O layer added, GMTI model and results updated
Altia-MLD interface developed, documented, and demonstrated Small, flexible set of generic MLD primitives to easily interface with Altia from any
simulation Minimal components required to interface MLD with Altia, seamless integration and
flexible designs demonstrated using our Altia component library SAR’s network requirements easily handled by networks designed for GMTI
Biggest challenge of SAR is memory requirement Could benefit from further in-depth study on processor/memory architecture issues
RapidIO Logical I/O layer found to benefit SAR CPI completion latency through response-less write operations Well-suited to distributed-memory approach
Double-buffering significantly improves performance of SAR application Increases local memory requirements by 2x-3x Smart use of double-buffering only on tasks with heavy computation may optimize benefit
Cut-through routing does not greatly benefit SAR (similar to GMTI) Much of the overall delay is found in either port contention or simply “waiting your turn,” for
example in an orchestrated many-to-one send RIO testbed facilities in development
Xilinx cores and hardware recently acquired Looking into other options (comments welcome)
2312/10/04
Future Work and Project Options (1) Follow on to current RapidIO work
Explore additional options for double-buffering of SAR Explore synchronization options for Logical I/O Layer to
improve SAR performance Produce a “Day in the life of SBR” simulation (with SAR
and GMTI) Develop a GMTI-specific Altia GUI (the current one is SAR-
specific) Develop a “Day in the life of SBR” GUI Explore Logical I/O simulations for GMTI Develop additional partitionings of SAR
Pipelined? Staggered chunks?
2412/10/04
Future Work and Project Options (2) Additional RapidIO Projects
Examine fault tolerance aspects of Honeywell's RapidIO including failures within chips (both endpoints and switches) and link Completely redundant network vs. graceful degradation with redundant links
Include a higher level of fidelity in the system boards, memory, processors, etc. especially regarding other system software like O/Ss
Model other applications, possibly including compression Study of RIO multicast spec and alternatives
Other Projects FPGA interconnectivity, control, configuration management and fault
detection and correction in satellite systems Investigation of architecture tradeoffs of the next generation SBC High-level networking issues in the Wireless Reconfigurable
Interconnects project ST-9 project- constellations of satellites flying in formation
Study algorithms for communication/cooperative processing across satellites Others?
2512/10/04
Future Collaboration Possibilities I/UCRC Air Force Research Lab
Munitions directorate (Eglin) Space vehicles directorate (?)
Internships David interested in a summer internship
Other Options?
2612/10/04
RACEway and RACE++ RACEway- open standard RACE++- Mercury Computer
Systems’ second generation of RACEway technology
“Legacy” switched interconnect option
Nodes connected via RACE/RACE++ crossbar switches RACEway: 6-port RACE++: 8 port
Scalability RACEway- up to 1000 nodes RACE++- over 4000 nodes
Adaptive routing RACE- can be implemented on 2 of 6
crossbar ports RACE++- can be implemented on all
8 crossbar ports Active backplane
Failure of a single crossbar will often result in the failure of an entire
Port-to-Port BW 160 MB/s
267 MB/s
Crossbar BW
480 MB/s
1 GM/s
2712/10/04
Appendix: Baseline Simulation Parameters Store-and-forward routing 250 MHz DDR RIO links 16-bit RIO links Endpoint input/output queue length = 8 RIO packets Endpoint priority 0 threshold = 4 packets
Endpoint retries priority 0 packets if it has greater than 4 packets in its buffer Other endpoint priority thresholds = 5,6,7 (for prio 1,2,3 respectively) Maximum payload size = 256 bytes Packet disassembly delay = 14ns Response creation delay = 12ns Responses upgraded 1 level of priority Switch priority 0 threshold = 3000 bytes
Switch retries priority 0 packets if it has less than 3000 bytes of free memory Other switch priority thresholds = 2000,1000,0 (for prio 1,2,3 respectively) TDM window size = 64ns TDM data copied per window = 64 bytes TDM minimum delay = 16ns