OPERA Software Architecture OSA - Zettaflopszettaflops.org/spc08/Crago-ISI-FTSCENT-OSA-5-08.pdf ·...
Transcript of OPERA Software Architecture OSA - Zettaflopszettaflops.org/spc08/Crago-ISI-FTSCENT-OSA-5-08.pdf ·...
-UNCLASSIFIED-
-UNCLASSIFIED-
OPERA Software Architecture
OSA
Steve CragoJanice McMahon
USC/ISI-EastMay 29, 2008
2-UNCLASSIFIED-
-UNCLASSIFIED-Outline
Introduction
OPERA Software Environment
Software Fault Tolerance for OPERA
3-UNCLASSIFIED-
-UNCLASSIFIED-Multicore Trends
x86 Multi-Core
Cell
GPU
TILE64
Power density and complexity problems have driven current and next generation processors to have multiple cores on chip
Intel Core DuoAMD8 core SUN Niagara-2Tilera TILE64
On-chip parallelism improves throughput of multiple applications, but results in programming challenges
Single applications must be parallelizedParallel applications must be scalableRequires highly skilled programmers or better tools
4-UNCLASSIFIED-
-UNCLASSIFIED-
Interprocessor Communication
Low latency connections between compute tiles expose new software issues (multi-core ≠ Symmetric Multiprocessor ≠High Performance Computer)
Traditional Multiprocessor
DRAM~100 cycles
DRAM~100 cycles~1000 cycles
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63
<10 cycleslatency between
tiles
Multi-Core Architecture
5-UNCLASSIFIED-
-UNCLASSIFIED-Performance Sources
Instruction Level ParallelismFine grainedAvailable in general-purpose legacy codeExploited by traditional superscalar and VLIW uniprocessorsTypically limited by branch prediction and dependencies
Data Level ParallelismEfficient to exploitAvailable in streaming applicationsRelatively easy for programmer to specifyExploited by multimedia extensions, Cell, SIMD architectures
Thread Level ParallelismExecutes tasks in parallelNeeded for scalabilityTrickier to program or extract from programInitial focus of commercial multi-core architectures
Multi-core tools must consider these sources of parallelism simultaneously. This differentiates multi-core tools from traditional uniprocessor and multiprocessor
tools.
6-UNCLASSIFIED-
-UNCLASSIFIED-Government/Industry Multi-core Roles
IndustryHardware: stamp x86 cores on a chip (e.g. Intel, AMD)Hardware: develop aggressive multi-core chip for specific commercial markets (e.g. Tilera)Evolutionary software tools and programming supportGeneral-purpose software research and education (e.g. fund universities to teach students how to program multi-core)
Government ResponsibilitiesSpecial needs (e.g. radiation hardening)Application libraries (e.g. MPI, VSIPL)Domain-specific toolsRun-time management for dynamic scenarios and autonomous operation (resource management and fault tolerance)Longer term research
7-UNCLASSIFIED-
-UNCLASSIFIED-Outline
Introduction
OPERA Software Environment
Software Fault Tolerance for OPERA
8-UNCLASSIFIED-
-UNCLASSIFIED-
OPERA Software Architecture GoalsEnable application demonstration of OPERA processor
Provide basic functionality to allow hardware technology demonstration
Leverage commercial Tilera software baseCurrent support is for sequential compiler and library for message passing and shared memory
Provide baseline programming environment for future missions
Based on familiar programming modelsStandard-compliant APIs
Provide technology base for future software technologies
Domain-specific parallelizationDynamic resource management Software fault tolerance
9-UNCLASSIFIED-
-UNCLASSIFIED-
OPERA Software Architecture
Long-Term MulticoreSystem Software Vision
Multi-core System Software
Applications
Run-time System
Static Compiler
Application/ System Libraries
(Static and Dynamic)BinariesIntermediate representations
Programming languagesExplicit/implicit parallelism
Fault Tolerance
Resource Manager
System Utilities
Dynamic Compiler
Multicore Hardware Resources
Processing cores I/O PortsMemory
BanksNetwork
Links
Supports dynamic, fault-tolerant, and autonomous operation
ApplicationLibraries
Applications
Parallel Binary Executable
Tilera Modified Compiler
LinkerCommunication
Libraries
10-UNCLASSIFIED-
-UNCLASSIFIED-
OSA Funded TasksTool extensions
Floating point extensions: compiler, simulator, debugger, iLiblibraryLegacy software support: C++, MPI, OpenMP, VSIPL, pVSIPL++, OpenPNL
Performance and productivity toolsParallel performance analysisParallel debugRun-time monitor and run-time systemFine-grain parallel compiler
ApplicationsJPEG2KCAFCo-addOn-board processing
11-UNCLASSIFIED-
-UNCLASSIFIED-
Outline
Introduction
OPERA Software Environment
Software Fault Tolerance for OPERA
12-UNCLASSIFIED-
-UNCLASSIFIED-
Software Fault Tolerance
Rad-hard by design (RHBD) as applied to OPERA will provide protection for total dose, latchup, and single-event upsets
Rate of updates can be traded off with cost of protectionRHBD can be used in combination with system-level software techniques
Other sources of faults will still exist, e.g.Software errorPhysical damage
Software fault tolerance provides mitigation for faults while minimizing overhead and can work in conjunction with RHBD to provide overall mission reliability
13-UNCLASSIFIED-
-UNCLASSIFIED-Run-time Research
Resource management and introspectionCore allocationPower managementIntrospection hooks
Performance monitoringBehavior monitoringSupport for programming and performance tuning
Fault toleranceSupport from compiler and run-time systemLimited redundancyCheck-pointing and roll-backInteraction with resource management
14-UNCLASSIFIED-
-UNCLASSIFIED-Fault Tolerance Opportunities
Redundancy availableCoresNetworksMemory interfacesI/O
ProgrammabilityFault tolerance and performance can be tuned according to application needs
15-UNCLASSIFIED-
-UNCLASSIFIED-Fault Tolerance Challenges
Increased stateCoresNetworksMemory interfacesI/O
ProgrammabilityProgramming can be topology-dependent and tied to physical location
16-UNCLASSIFIED-
-UNCLASSIFIED-Summary
OPERA is an opportunity to provide unprecedented general-purpose performance for space
OPERA software will build on commercial software to provide an environment suitable for government applications
OPERA provides both challenges and opportunities for fault tolerance
Being addressed through both hardware and software