Scientific Computing in Space Using COTS Processors Roger Sowada Honeywell DSES roger.j....

Scientific Computing in Space Using COTS Processors

Roger Sowada

Honeywell DSESroger.j. [email protected]

Jeremy Ramos

Honeywell [email protected]

David Lupia

Honeywell [email protected]

Ramos 2 150/MAPLD 2005

Agenda Introduction Background Detail Description Implementation Approach Development Efforts Acknowledgements

University of Florida Key contributors to software prototype effort and research Alan George and the High-performance Computing and Simulation Lab

Physical Sciences Inc. SEU Sensor Provider Gary Galica and Robin Cox

WW Technologies Inc. RPI Middleware Provider Chris Walters and Technical Staff

NASA New Millennium Program Program Sponsor


Processing Platforms for New Science

The success of recent rover missions are a perfect example of the type of science we want to support

Though returns from rover missions are significant they could be orders of magnitude greater with sufficient autonomy and on-board processing capabilities

Similarly, deep space probes as well as Earth orbiting instruments can benefit from increases in on-board processing capabilities

In all cases increases in science data returns are dependant on the spacecraft’s processing platform capabilities


Payload Processing Conceptual Model

TDP ODP MDP

DATA RATES

Algorithm Complexity/Abstraction

Data RatesOperations/Sec

HIGH MED LOW

AlgorithmComplexityand Abstraction

LOW MED HIGH

SensorArray

Low BW

TelemetryTime Dependent Processing

TDP

ObjectDependent Processing

ODP

MissionDependent Processing

MDP

Sample-LevelSignal Processing

Frame-LevelSignal Processing

High-Level LogicOperations

10,000

1,000

100

10

Da

ta R

ate

s (

Mb

ps

)

1

100,000

10,000

1,000

Alg

ori

thm

Co

mp

lex

ity

(M

IPS

MO

PS

/)

10

100


Technology Advance A spacecraft onboard payload data processing system architecture,

including a software framework and set of fault tolerance techniques, which provides:

A. An architecture and methodology that enables COTS based, high performance, scalable, multi-computer systems, incorporating reconfigurable co-processors, and supporting parallel/distributed processing for science codes, that accommodates future COTS parts/standards through upgrades.

B. An application software development and runtime environment that is familiar to science application developers, and facilitates porting of applications from the laboratory to the spacecraft payload data processor.

C. An autonomous and adaptive controller for fault tolerance configuration, responsive to environment, application criticality and system mode, that maintains required dependability and availability while optimizing resource utilization and system efficiency.

D. Methods and tools which allow the prediction of the system’s behavior in the space environment, including: predictions of availability, dependability, fault rates/types, and system level performance.


Radiation Environments Traditionally microelectronics have been designed and manufactured specifically

for use in radiation environments Some COTS microelectronic manufacturing process yield components that are

partly resistant to radiation effects (tolerant to TID and latch-up immune) In most cases Single Event Effects are of greatest concern - Resulting in mostly bit

flips (SEU) and functional interrupts (SEFIs)

Natural Radiation

Upset rate as a function of orbit location

1.00E-12

1.00E-11

1.00E-10

1.00E-09

1.00E-08

1.00E-07

1.00E-06

1.00E-05

1.00E-04

Orbit Location (with precession)

up

se

ts p

er

bit

-da

y

heavy ion upsets

proton upsets

Total upsets

Discrete Simulation for 7 orbits of Xilinx V2 FPGA Shows trend driven by changes in particle flux Orbit: 300km perigee, 1400 apogee, 70° inclination


N-Modular Redundancy The popular approach for mitigating SEUs is to employ fixed

component level redundancy. This technique can be applied at all levels of the system hierarchy

from circuit to box. One major disadvantage of fixed redundancy is low efficiency and

unrealized system capacity.

Module 1 Module 2 Module 3

MajorityVoter

Example N-Mod Redundancy TMR (Triple Modular Redundancy) Typically used in COTS-based

microprocessor and Xilinx FPGA-based reconfigurable designs.


Adaptive Fault Tolerance Current COTS-based space computing/electronics systems use fixed-

architecture designs based on brute-force, worst case fault masking techniques. Triple Modular Redundancy (TMR) is typically a hard-wired design approach for Rad

Tolerant G4 PPC processors and Xilinx FPGAs

The effectiveness and performance (MIPS/W) gains that the COTS device brings is degraded substantially by the use of a fixed design, worst-case redundancy scheme.

EAFTC enables the computer subsystem to take advantage of changing orbital environments during a mission life to utilize the COTS processing elements more efficiently as the environment allows. This allows the EAFTC system to adaptively trade performance verses reliability in real time.

EAFTC Based System

SoftwareImplemented FT

COTS ProcessingComponents in a

Reconfigurable Arch

EnvironmentalSensory

(Radiation, position)

Adaptive ControlAlgorithms


MIP

S p

er W

att

Orbit Position

MIPS/Watt for worst case design

Average MIPS/Watt for EAFTC design

EAFTC Operational ScenarioEAFTC Operational Scenario

EAFTC exploits the SEU to orbit position relation as well as the variable criticality of system tasks

The fundamental process implemented in the system consists of three steps:

measure the environment and system state assess the environmental threat to the applications availability adapt the processing applications configuration (i.e. fault

tolerance) to effectively mitigate the threat presented by the environment.

On average more computation can be performed using EAFTC with less energy

SE

U R

ates


Hardware Architecture

System Controller Controller for APC Cluster Hosts EAFTC controller

software and other experiment related control software

RadHard processor and interfaces for reliable controller of COTS cluster

APC Cluster Consist of several APC

Nodes Networked together

with RapidIOAdaptive Processing Computer

Reconfigurable based processing node

Multiple modes/configurations High-performance COTS processor

(PPC) RapidIO network interface Reconfigurable co-processor

PW

R (

3.3

,1.5

,+/-

12V

)

Therm

isto

r

Control andData

SS

IO

Protonscintillator andPhoto Detector

cPCI Connector

SSMController

FPGA

Alarm Analog/Digital Electronics

OutputThreshold

Ionscintillator andPhoto Detector


OutputThreshold

COTS Protonscintillator andPhoto Detector


OutputThreshold

Control andData

Control andData

SEU Alarm Provides measure of

SEU-inducing flux & particle energy

Used by EAFTC controller to determine real-time threat level to SEUs

Separate heavy ion and proton sensors

SystemController

B

SystemController

A Data Processor

1

Spacecraft I/F

Data Processor

N

Mission Specific Devices

Instruments

Processor Controller

750 FXPower

PC

FPGACo-Processor

Memory(Boot and System)

High-SpeedNetworkInterface

N P

ort

s

I/O In

terf

ace

. . .

. . .Spacecraft I/F

Network A

Network B


Adaptive Processing Computer Conceptual Block DiagramAdaptive Processing Computer Conceptual Block Diagram

Processor Controller

Power PC

BOOT Memory512KB

ReprogrammableNon-volatile

Memory With EDAC128MB

RAM1GB (Error

Correction with Scrubbing)

Clock Generation

Reset Generation

PWR Detection and Control

Co-Processor FPGA

UART

SSIO

Discretes

Current Sensor

TemperatureSensor

High Speed Network Switch

3 Ports

Network I/F

32-Bit PCI

Health and Status

JTAG Port

3 PortsExternal Reset


EAFTC Application Platform

Hardware

Mission Specific FT ControlApplications

FT Middleware

OS

Network

Hardware

Application

OS

FT LibCo Proc Lib

FT Middleware

• Local Management Agents• Replication Services• Fault Detection

• Scientific Application• Application Specific FT

• FT Manager• EAFTC Controller• Job Manager

Generic FaultTolerant

Framework

OS/HardwareSpecific

ApplicationSpecific

System Controller Data ProcessorApplication Programming

Interface (API)

SAL(System Abstraction Layer)

PoliciesConfigurationParameters

...

FPGA


EAFTC Middleware

Provides a high-performance platform for parallel/distributed applications Cluster and job management to provide a single system view to the application Message Passing Interface API Platform abstraction to include OS system calls and hardware registers Mission Level Customization through policies Scalable architecture to support clustering of resources on multi-computer

system Reconfigurable co-processors devices for application acceleration

Provides a high-availability platform for applications An autonomous and adaptive controller for fault tolerance configuration that

maintains required dependability and availability while optimizing resource utilization and system efficiency.

Checkpoint and rollback service for application recovery in the event of a fault. Application level replication services to facilitate reliable deployment of

applications in SEU susceptible COTS processing resources EAFTC Middleware offers numerous benefits as a system platform

Capitalize on cost savings in the use of commercial hardware Capitalize on latest processing technology through technology refresh Reduces cost and extends system life through a software-based middleware

solution Scales to meet system requirements Customizable degree of fault tolerance to meet specific system needs


EAFTC Software Architecture

Network and sideband signals

System Controller Data Processor with FPGA Co-Processor

VxWorks OS, network stack, and drivers Linux OS, and Drivers

DMS, CMS, AMS, and RDB DMS, CMS, AMS, and RDB

JM FTMESM

ESM – Environmental Sensor MonitorJM – Job ManagementFTM- Fault Tolerance ManagerMPI – Message Passing InterfaceFCPS – FPGA Co-Processor ServicesCR – Checkpoint and RollbackCMS – Cluster Management ServicesAMS – Availability Management ServicesDMS – Distributed Messaging ServicesRDB – Replicated Database

■ Mission Specific Components

■ EAFTC Specific Components

■ Self Reliant Components

■ Platform Components

■ Application Components

Mission Specific Parameters

CRMPIJMA FTMA

System Controller Data Processors

RS FCPS

Application Process


EAFTC Software Components Collaboration

JM

FTM

ESM

SR

P1.1

RSFTMA

P1.2

RSFTMA

P1.3

RSFTMASR SR SR

JMA JMA JMA

JM

FTM

ESM

SR

T1

MPIFTMA

T2

MPIFTMA

T3

MPIFTMASR SR SR

JMA JMA JMA


TRL4Technology

Validation

TRL5Technology

Validation

TRL6Technology

Validation

TRL7Technology

Validation

Development Workstation (Payload Controller

Instrumentation)

`

System Controller

Data Processor 1

Data Processor 2

Data Processor 3

Data Processor 4

Data Processor 1

(Motorola SBC with

FPGA PMC)

~10,000MIPS

System Controller

(Ganymede)

~150MIPS

Data Processor 2

(Motorola SBC with

FPGA PMC)

~10,000MIPS

Data Processor 3

(Motorola SBC)

~1500MIPS

Data Processor 4

(Motorola SBC)

~1500MIPS

Gigabit Ethernet Switch

1 Gbs per link

1 GbsExperiment Controller and Data

Collection

cPCI Chassis with Power Instrumentation

Instrumentation Bus

100 Mbs

TRL7 Validation - Demonstrate EAFTC technologies in a real space environment - Validate predictive models and predictive model parameters with experiment data - TRL7 experiments will be identical to those performed and rung out during TRL6 demonstration and validation

TRL6 Validation - Demonstrate enhanced EAFTC technologies in a laboratory environment on prototype flight hardware including exposure to radiation beam - Validate and refine predictive models and predictive model parameters with experiment data - complete set of canonical fault injection experiments

TRL4 Validation - Demonstrated basic EAFTC technologies in a laboratory environment on COTS hardware testbed including radiation source and sensor - Environment Sensor - Alert Generator - High Availability Middleware - Replication Services

NASA adds requirementfor fault tolerant clusterand MPI capability

TRL5 Validation - Demonstrate basic EAFTC technologies in a laboratory environment on testbed hardware with partially integrated Fault Tolerance Services - Develop predictive models - Validate and refine predictive models and predictive model parameters with experiment data - partial set of canonical fault injection experiments

SEU AlarmVxWorksVISAWWTG MW ComponentsBenchmark ApplicationRIO Network Stack

<<processor>> #4HSBC:

Data Processor



Data Processor



Data Processor

6 Ports

<<device>>Ethernet:Switch

<<cPCI backplane>>

Benchmark Application

<<processor>>HRSC: RC Processor

Yellow Dog - Linux 2.4HA MiddlewareFT NodeBenchmark Application

<<processor>> #1Raptor-DX SBC:Data Processor

VxWorksVISAHRSC DriverEAFTCFT ControllerHA Middleware

<<processor>>Ganymede SBC:

System Controller

Development Workstation

Compact PCI Chassis

<<VME backplane>>

VME Chassis

Increasing fidelityand capability

EAFTC Technology Advances to TRL7 Flight Experiment


EAFTC Model Flow

Rad EffectsModel

HW SEUSusceptibility Model

Model

Canonical Fault Model

Availability& Reliability

Models

Performance Model

Delivered ThroughputDelivered Throughput DensityEffective System Utilization

Inputs:• Mission application characterization and constraints • Peak Throughput per CPU• Number of nodes in cluster• Algorithm/Architecture Coupling Efficiency for application• Network-level parallelization efficiency• Measured OS and FT Services overhead• Measured execution times for applications

Availability& Reliability

Fault rates foreach fault type inthe canonicalfault model (n)

Inputs:• Probability that fault effects application• Detection coverage for each fault/error type in the canonical model• Recovery coverage for each fault/error type in the canonical fault model• Detection and recovery latencies for each fault• Number of mode change types and rates • Time to effect mode change• Probability that mode change is successful

Canonical fault types

Canonical fault types

Inputs:• Decomposed HW Architecture• Comprehensive Fault Model

Inputs:• Orbit• Epoch• Radiation characterization of components• System architecture• HW architecture

Particle fluxes,Energies,& component SEE effects


TRL4 EAFTC System Technology Demonstration

Successful demonstration of EAFTC system

The EAFTC prototype comprises key technology elements Cluster Computer Autonomous Controller Replication Services

Environment input is simulated via SPENVIS radiation models

Instrumentation for power utilization is included in the model

Profiling is integrated on Data Processors for cpu utilization measurement

Workload is provided via synthetic benchmark application on Data Processors



Data Processor



Data Processor



Data Processor

6 Ports

<<device>>Ethernet:Switch

<<cPCI backplane>>

Benchmark Application

<<processor>>HRSC: RC Processor

Yellow Dog - Linux 2.4RP MiddlewareMessg. MiddlewareFT NodeBenchmark Application

<<processor>> #1Raptor-DX SBC:Data Processor

VxWorksVISAHRSC DriverEAFTC controllerFT ControllerMessg. Middleware

<<processor>>Ganymede SBC:

System Controller

Development Workstation

Compact PCI Chassis

<<VME backplane>>

VME Chassis


Computer Capacity Experiment

TMR 3 node system EAFTC 4 node system

average power: 72 Watts

average system effective MIPS: 973 MIPS

average system efficiency: 13 MIPS/Watt

average power: 97 Watts

average system effective MIPS: 2661 MIPS

average system efficiency: 28 MIPS/Watt

Comparison: 35% increase in power consumption, 173% increase in effective MIPS, and 115% increase in efficiency


TRL5 Platform Consists of 4 Data Processors implemented

with COTS Single Board Computers (SBCs) and PCI Mezzanine Cards

SBCs will implement a PPC 750FX microprocessor running the Linux operating system and a Software Fault Injectors for fault simulation.

The PMCs will implement a Xilinx Virtex2 FPGA that will serve as the co-processor for its host SBC

The System Controller will be implemented with a software development unit of our flight SBC.

All nodes in the cluster will be interconnected via a GigE switch.

A Development Workstation will be used for software development, experiment control, and instrumentation data collection.

Software Implemented Fault Injection (SWIFI) will be the primary method for simulating faults. Other methods may be used such as manual node resets, network traffic fault injections (via software or hardware fault injection methods), and test port inserted faults

Development Workstation (Payload Controller

Instrumentation)

`

System Controller

Data Processor 1

Data Processor 2

Data Processor 3

Data Processor 4


New Millennium Program Space Technology 8

New Millennium Program NASA program for technology development Currently working on its 8th technology development program In Formulation phase to evaluate 4 subsystem technologies (one of them

EAFTC) The objective of the NMP ST8 EAFTC mission is to validate EAFTC

technology at TRL7 through experimentation in space. SSR 7/05 PDR 5/06 (TRL5) CDR 5/07 (TRL6) Launch 12/08 (TRL7 after 6 month on-orbit experiment)

Our team’s overall goal is to demonstrate that EAFTC is a competitive and low-risk solution for missions needing COTS high-performance on-board payload processing. We will demonstrate that by using EAFTC we can maximize and

significantly improve the performance of a COTS based computer in orbit.


Summary

EAFTC is an enabling technology for high performance spacecraft computing.

As part of our NMP sponsored efforts a TRL4 system has been demonstrated

Efforts continue towards a TRL5 system demonstration.

Scientific Computing in Space Using COTS Processors Roger Sowada Honeywell DSES roger.j....

Documents

Transcript of Scientific Computing in Space Using COTS Processors Roger Sowada Honeywell DSES roger.j....