Scientific Computing in Space Using COTS Processors Roger Sowada Honeywell DSES roger.j....
-
Upload
arlene-stevens -
Category
Documents
-
view
217 -
download
1
Transcript of Scientific Computing in Space Using COTS Processors Roger Sowada Honeywell DSES roger.j....
Scientific Computing in Space Using COTS Processors
Roger Sowada
Honeywell DSESroger.j. [email protected]
Jeremy Ramos
Honeywell [email protected]
David Lupia
Honeywell [email protected]
Ramos 2 150/MAPLD 2005
Agenda Introduction Background Detail Description Implementation Approach Development Efforts Acknowledgements
University of Florida Key contributors to software prototype effort and research Alan George and the High-performance Computing and Simulation Lab
Physical Sciences Inc. SEU Sensor Provider Gary Galica and Robin Cox
WW Technologies Inc. RPI Middleware Provider Chris Walters and Technical Staff
NASA New Millennium Program Program Sponsor
Ramos 3 150/MAPLD 2005
Processing Platforms for New Science
The success of recent rover missions are a perfect example of the type of science we want to support
Though returns from rover missions are significant they could be orders of magnitude greater with sufficient autonomy and on-board processing capabilities
Similarly, deep space probes as well as Earth orbiting instruments can benefit from increases in on-board processing capabilities
In all cases increases in science data returns are dependant on the spacecraft’s processing platform capabilities
Ramos 4 150/MAPLD 2005
Payload Processing Conceptual Model
TDP ODP MDP
DATA RATES
Algorithm Complexity/Abstraction
Data RatesOperations/Sec
HIGH MED LOW
AlgorithmComplexityand Abstraction
LOW MED HIGH
SensorArray
Low BW
TelemetryTime Dependent Processing
TDP
ObjectDependent Processing
ODP
MissionDependent Processing
MDP
Sample-LevelSignal Processing
Frame-LevelSignal Processing
High-Level LogicOperations
10,000
1,000
100
10
Da
ta R
ate
s (
Mb
ps
)
1
100,000
10,000
1,000
Alg
ori
thm
Co
mp
lex
ity
(M
IPS
MO
PS
/)
10
100
Ramos 5 150/MAPLD 2005
Technology Advance A spacecraft onboard payload data processing system architecture,
including a software framework and set of fault tolerance techniques, which provides:
A. An architecture and methodology that enables COTS based, high performance, scalable, multi-computer systems, incorporating reconfigurable co-processors, and supporting parallel/distributed processing for science codes, that accommodates future COTS parts/standards through upgrades.
B. An application software development and runtime environment that is familiar to science application developers, and facilitates porting of applications from the laboratory to the spacecraft payload data processor.
C. An autonomous and adaptive controller for fault tolerance configuration, responsive to environment, application criticality and system mode, that maintains required dependability and availability while optimizing resource utilization and system efficiency.
D. Methods and tools which allow the prediction of the system’s behavior in the space environment, including: predictions of availability, dependability, fault rates/types, and system level performance.
Ramos 6 150/MAPLD 2005
Radiation Environments Traditionally microelectronics have been designed and manufactured specifically
for use in radiation environments Some COTS microelectronic manufacturing process yield components that are
partly resistant to radiation effects (tolerant to TID and latch-up immune) In most cases Single Event Effects are of greatest concern - Resulting in mostly bit
flips (SEU) and functional interrupts (SEFIs)
Natural Radiation
Upset rate as a function of orbit location
1.00E-12
1.00E-11
1.00E-10
1.00E-09
1.00E-08
1.00E-07
1.00E-06
1.00E-05
1.00E-04
Orbit Location (with precession)
up
se
ts p
er
bit
-da
y
heavy ion upsets
proton upsets
Total upsets
Discrete Simulation for 7 orbits of Xilinx V2 FPGA Shows trend driven by changes in particle flux Orbit: 300km perigee, 1400 apogee, 70° inclination
Ramos 7 150/MAPLD 2005
N-Modular Redundancy The popular approach for mitigating SEUs is to employ fixed
component level redundancy. This technique can be applied at all levels of the system hierarchy
from circuit to box. One major disadvantage of fixed redundancy is low efficiency and
unrealized system capacity.
Module 1 Module 2 Module 3
MajorityVoter
Example N-Mod Redundancy TMR (Triple Modular Redundancy) Typically used in COTS-based
microprocessor and Xilinx FPGA-based reconfigurable designs.
Ramos 8 150/MAPLD 2005
Adaptive Fault Tolerance Current COTS-based space computing/electronics systems use fixed-
architecture designs based on brute-force, worst case fault masking techniques. Triple Modular Redundancy (TMR) is typically a hard-wired design approach for Rad
Tolerant G4 PPC processors and Xilinx FPGAs
The effectiveness and performance (MIPS/W) gains that the COTS device brings is degraded substantially by the use of a fixed design, worst-case redundancy scheme.
EAFTC enables the computer subsystem to take advantage of changing orbital environments during a mission life to utilize the COTS processing elements more efficiently as the environment allows. This allows the EAFTC system to adaptively trade performance verses reliability in real time.
EAFTC Based System
SoftwareImplemented FT
COTS ProcessingComponents in a
Reconfigurable Arch
EnvironmentalSensory
(Radiation, position)
Adaptive ControlAlgorithms
Ramos 9 150/MAPLD 2005
MIP
S p
er W
att
Orbit Position
MIPS/Watt for worst case design
Average MIPS/Watt for EAFTC design
EAFTC Operational ScenarioEAFTC Operational Scenario
EAFTC exploits the SEU to orbit position relation as well as the variable criticality of system tasks
The fundamental process implemented in the system consists of three steps:
measure the environment and system state assess the environmental threat to the applications availability adapt the processing applications configuration (i.e. fault
tolerance) to effectively mitigate the threat presented by the environment.
On average more computation can be performed using EAFTC with less energy
SE
U R
ates
Ramos 10 150/MAPLD 2005
Hardware Architecture
System Controller Controller for APC Cluster Hosts EAFTC controller
software and other experiment related control software
RadHard processor and interfaces for reliable controller of COTS cluster
APC Cluster Consist of several APC
Nodes Networked together
with RapidIOAdaptive Processing Computer
Reconfigurable based processing node
Multiple modes/configurations High-performance COTS processor
(PPC) RapidIO network interface Reconfigurable co-processor
PW
R (
3.3
,1.5
,+/-
12V
)
Therm
isto
r
Control andData
SS
IO
Protonscintillator andPhoto Detector
cPCI Connector
SSMController
FPGA
Alarm Analog/Digital Electronics
OutputThreshold
Ionscintillator andPhoto Detector
Alarm Analog/Digital Electronics
OutputThreshold
COTS Protonscintillator andPhoto Detector
Alarm Analog/Digital Electronics
OutputThreshold
Control andData
Control andData
SEU Alarm Provides measure of
SEU-inducing flux & particle energy
Used by EAFTC controller to determine real-time threat level to SEUs
Separate heavy ion and proton sensors
SystemController
B
SystemController
A Data Processor
1
Spacecraft I/F
Data Processor
N
Mission Specific Devices
Instruments
Processor Controller
750 FXPower
PC
FPGACo-Processor
Memory(Boot and System)
High-SpeedNetworkInterface
N P
ort
s
I/O In
terf
ace
. . .
. . .Spacecraft I/F
Network A
Network B
Ramos 11 150/MAPLD 2005
Adaptive Processing Computer Conceptual Block DiagramAdaptive Processing Computer Conceptual Block Diagram
Processor Controller
Power PC
BOOT Memory512KB
ReprogrammableNon-volatile
Memory With EDAC128MB
RAM1GB (Error
Correction with Scrubbing)
Clock Generation
Reset Generation
PWR Detection and Control
Co-Processor FPGA
UART
SSIO
Discretes
Current Sensor
TemperatureSensor
High Speed Network Switch
3 Ports
Network I/F
32-Bit PCI
Health and Status
JTAG Port
3 PortsExternal Reset
Ramos 12 150/MAPLD 2005
EAFTC Application Platform
Hardware
Mission Specific FT ControlApplications
FT Middleware
OS
Network
Hardware
Application
OS
FT LibCo Proc Lib
FT Middleware
• Local Management Agents• Replication Services• Fault Detection
• Scientific Application• Application Specific FT
• FT Manager• EAFTC Controller• Job Manager
Generic FaultTolerant
Framework
OS/HardwareSpecific
ApplicationSpecific
System Controller Data ProcessorApplication Programming
Interface (API)
SAL(System Abstraction Layer)
PoliciesConfigurationParameters
...
FPGA
Ramos 13 150/MAPLD 2005
EAFTC Middleware
Provides a high-performance platform for parallel/distributed applications Cluster and job management to provide a single system view to the application Message Passing Interface API Platform abstraction to include OS system calls and hardware registers Mission Level Customization through policies Scalable architecture to support clustering of resources on multi-computer
system Reconfigurable co-processors devices for application acceleration
Provides a high-availability platform for applications An autonomous and adaptive controller for fault tolerance configuration that
maintains required dependability and availability while optimizing resource utilization and system efficiency.
Checkpoint and rollback service for application recovery in the event of a fault. Application level replication services to facilitate reliable deployment of
applications in SEU susceptible COTS processing resources EAFTC Middleware offers numerous benefits as a system platform
Capitalize on cost savings in the use of commercial hardware Capitalize on latest processing technology through technology refresh Reduces cost and extends system life through a software-based middleware
solution Scales to meet system requirements Customizable degree of fault tolerance to meet specific system needs
Ramos 14 150/MAPLD 2005
EAFTC Software Architecture
Network and sideband signals
System Controller Data Processor with FPGA Co-Processor
VxWorks OS, network stack, and drivers Linux OS, and Drivers
DMS, CMS, AMS, and RDB DMS, CMS, AMS, and RDB
JM FTMESM
ESM – Environmental Sensor MonitorJM – Job ManagementFTM- Fault Tolerance ManagerMPI – Message Passing InterfaceFCPS – FPGA Co-Processor ServicesCR – Checkpoint and RollbackCMS – Cluster Management ServicesAMS – Availability Management ServicesDMS – Distributed Messaging ServicesRDB – Replicated Database
■ Mission Specific Components
■ EAFTC Specific Components
■ Self Reliant Components
■ Platform Components
■ Application Components
Mission Specific Parameters
CRMPIJMA FTMA
System Controller Data Processors
RS FCPS
Application Process
Ramos 15 150/MAPLD 2005
EAFTC Software Components Collaboration
JM
FTM
ESM
SR
P1.1
RSFTMA
P1.2
RSFTMA
P1.3
RSFTMASR SR SR
JMA JMA JMA
JM
FTM
ESM
SR
T1
MPIFTMA
T2
MPIFTMA
T3
MPIFTMASR SR SR
JMA JMA JMA
Ramos 16 150/MAPLD 2005
TRL4Technology
Validation
TRL5Technology
Validation
TRL6Technology
Validation
TRL7Technology
Validation
Development Workstation (Payload Controller
Instrumentation)
`
System Controller
Data Processor 1
Data Processor 2
Data Processor 3
Data Processor 4
Data Processor 1
(Motorola SBC with
FPGA PMC)
~10,000MIPS
System Controller
(Ganymede)
~150MIPS
Data Processor 2
(Motorola SBC with
FPGA PMC)
~10,000MIPS
Data Processor 3
(Motorola SBC)
~1500MIPS
Data Processor 4
(Motorola SBC)
~1500MIPS
Gigabit Ethernet Switch
1 Gbs per link
1 GbsExperiment Controller and Data
Collection
cPCI Chassis with Power Instrumentation
Instrumentation Bus
100 Mbs
TRL7 Validation - Demonstrate EAFTC technologies in a real space environment - Validate predictive models and predictive model parameters with experiment data - TRL7 experiments will be identical to those performed and rung out during TRL6 demonstration and validation
TRL6 Validation - Demonstrate enhanced EAFTC technologies in a laboratory environment on prototype flight hardware including exposure to radiation beam - Validate and refine predictive models and predictive model parameters with experiment data - complete set of canonical fault injection experiments
TRL4 Validation - Demonstrated basic EAFTC technologies in a laboratory environment on COTS hardware testbed including radiation source and sensor - Environment Sensor - Alert Generator - High Availability Middleware - Replication Services
NASA adds requirementfor fault tolerant clusterand MPI capability
TRL5 Validation - Demonstrate basic EAFTC technologies in a laboratory environment on testbed hardware with partially integrated Fault Tolerance Services - Develop predictive models - Validate and refine predictive models and predictive model parameters with experiment data - partial set of canonical fault injection experiments
SEU AlarmVxWorksVISAWWTG MW ComponentsBenchmark ApplicationRIO Network Stack
<<processor>> #4HSBC:
Data Processor
SEU AlarmVxWorksVISAWWTG MW ComponentsBenchmark ApplicationRIO Network Stack
<<processor>> #3HSBC:
Data Processor
SEU AlarmVxWorksVISAWWTG MW ComponentsBenchmark ApplicationRIO Network Stack
<<processor>> #2HSBC:
Data Processor
6 Ports
<<device>>Ethernet:Switch
<<cPCI backplane>>
Benchmark Application
<<processor>>HRSC: RC Processor
Yellow Dog - Linux 2.4HA MiddlewareFT NodeBenchmark Application
<<processor>> #1Raptor-DX SBC:Data Processor
VxWorksVISAHRSC DriverEAFTCFT ControllerHA Middleware
<<processor>>Ganymede SBC:
System Controller
Development Workstation
Compact PCI Chassis
<<VME backplane>>
VME Chassis
Increasing fidelityand capability
EAFTC Technology Advances to TRL7 Flight Experiment
Ramos 17 150/MAPLD 2005
EAFTC Model Flow
Rad EffectsModel
HW SEUSusceptibility Model
Model
Canonical Fault Model
Availability& Reliability
Models
Performance Model
Delivered ThroughputDelivered Throughput DensityEffective System Utilization
Inputs:• Mission application characterization and constraints • Peak Throughput per CPU• Number of nodes in cluster• Algorithm/Architecture Coupling Efficiency for application• Network-level parallelization efficiency• Measured OS and FT Services overhead• Measured execution times for applications
Availability& Reliability
Fault rates foreach fault type inthe canonicalfault model (n)
Inputs:• Probability that fault effects application• Detection coverage for each fault/error type in the canonical model• Recovery coverage for each fault/error type in the canonical fault model• Detection and recovery latencies for each fault• Number of mode change types and rates • Time to effect mode change• Probability that mode change is successful
Canonical fault types
Canonical fault types
Inputs:• Decomposed HW Architecture• Comprehensive Fault Model
Inputs:• Orbit• Epoch• Radiation characterization of components• System architecture• HW architecture
Particle fluxes,Energies,& component SEE effects
Ramos 18 150/MAPLD 2005
TRL4 EAFTC System Technology Demonstration
Successful demonstration of EAFTC system
The EAFTC prototype comprises key technology elements Cluster Computer Autonomous Controller Replication Services
Environment input is simulated via SPENVIS radiation models
Instrumentation for power utilization is included in the model
Profiling is integrated on Data Processors for cpu utilization measurement
Workload is provided via synthetic benchmark application on Data Processors
SEU AlarmVxWorksVISAWWTG MW ComponentsBenchmark ApplicationRIO Network Stack
<<processor>> #4HSBC:
Data Processor
SEU AlarmVxWorksVISAWWTG MW ComponentsBenchmark ApplicationRIO Network Stack
<<processor>> #3HSBC:
Data Processor
SEU AlarmVxWorksVISAWWTG MW ComponentsBenchmark ApplicationRIO Network Stack
<<processor>> #2HSBC:
Data Processor
6 Ports
<<device>>Ethernet:Switch
<<cPCI backplane>>
Benchmark Application
<<processor>>HRSC: RC Processor
Yellow Dog - Linux 2.4RP MiddlewareMessg. MiddlewareFT NodeBenchmark Application
<<processor>> #1Raptor-DX SBC:Data Processor
VxWorksVISAHRSC DriverEAFTC controllerFT ControllerMessg. Middleware
<<processor>>Ganymede SBC:
System Controller
Development Workstation
Compact PCI Chassis
<<VME backplane>>
VME Chassis
Ramos 19 150/MAPLD 2005
Computer Capacity Experiment
TMR 3 node system EAFTC 4 node system
average power: 72 Watts
average system effective MIPS: 973 MIPS
average system efficiency: 13 MIPS/Watt
average power: 97 Watts
average system effective MIPS: 2661 MIPS
average system efficiency: 28 MIPS/Watt
Comparison: 35% increase in power consumption, 173% increase in effective MIPS, and 115% increase in efficiency
Ramos 20 150/MAPLD 2005
TRL5 Platform Consists of 4 Data Processors implemented
with COTS Single Board Computers (SBCs) and PCI Mezzanine Cards
SBCs will implement a PPC 750FX microprocessor running the Linux operating system and a Software Fault Injectors for fault simulation.
The PMCs will implement a Xilinx Virtex2 FPGA that will serve as the co-processor for its host SBC
The System Controller will be implemented with a software development unit of our flight SBC.
All nodes in the cluster will be interconnected via a GigE switch.
A Development Workstation will be used for software development, experiment control, and instrumentation data collection.
Software Implemented Fault Injection (SWIFI) will be the primary method for simulating faults. Other methods may be used such as manual node resets, network traffic fault injections (via software or hardware fault injection methods), and test port inserted faults
Development Workstation (Payload Controller
Instrumentation)
`
System Controller
Data Processor 1
Data Processor 2
Data Processor 3
Data Processor 4
Ramos 21 150/MAPLD 2005
New Millennium Program Space Technology 8
New Millennium Program NASA program for technology development Currently working on its 8th technology development program In Formulation phase to evaluate 4 subsystem technologies (one of them
EAFTC) The objective of the NMP ST8 EAFTC mission is to validate EAFTC
technology at TRL7 through experimentation in space. SSR 7/05 PDR 5/06 (TRL5) CDR 5/07 (TRL6) Launch 12/08 (TRL7 after 6 month on-orbit experiment)
Our team’s overall goal is to demonstrate that EAFTC is a competitive and low-risk solution for missions needing COTS high-performance on-board payload processing. We will demonstrate that by using EAFTC we can maximize and
significantly improve the performance of a COTS based computer in orbit.