High Performance Embedded Computing Software Initiative ... · High Performance Embedded Computing...
Transcript of High Performance Embedded Computing Software Initiative ... · High Performance Embedded Computing...
-
Slide-1HPEC-SI
MITRE AFRLMIT Lincoln Laboratorywww.hpec-si.org
High Performance Embedded ComputingSoftware Initiative (HPEC-SI)
This work is sponsored by the High Performance Computing Modernization Office under Air Force Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government.
Dr. Jeremy KepnerMIT Lincoln Laboratory
-
Slide-2www.hpec-si.org
MITRE AFRLLincoln
Outline
• DoD Need• Program Structure• Introduction
• Software Standards
• Parallel VSIPL++
• Future Challenges
• Summary
-
Slide-3www.hpec-si.org
MITRE AFRLLincoln
Overview - High Performance Embedded Computing (HPEC) Initiative
Enhanced Tactical Radar Correlator(ETRAC)
ASARS-2
Shared memory serverEmbedded
multi-processor
Common Imagery Processor (CIP)
HPEC Software Initiative
Programs
Demonstration
Development
Applied Research
DARPA
Challenge: Transition advanced software technology and practices into major defense acquisition programs
-
Slide-4www.hpec-si.org
MITRE AFRLLincoln
Why Is DoD Concerned with Embedded Software?
$0.0
$1.0
$2.0
$3.0
FY98
Source: “HPEC Market Study” March 2001
Estimated DoD expenditures for embedded signal and image processing hardware and software ($B)
• COTS acquisition practices have shifted the burden from “point design” hardware to “point design” software
• Software costs for embedded systems could be reduced by one-third with improved programming models, methodologies, and standards
• COTS acquisition practices have shifted the burden from “point design” hardware to “point design” software
• Software costs for embedded systems could be reduced by one-third with improved programming models, methodologies, and standards
-
Slide-5www.hpec-si.org
MITRE AFRLLincoln
Issues with Current HPEC Development Inadequacy of Software Practices & Standards
NSSNAEGIS
Rivet Joint
Standard Missile
PredatorGlobal Hawk
U-2
JSTARS MSAT-Air
P-3/APS-137
FF--1616
MK-48 Torpedo
Today – Embedded Software Is: • Not portable• Not scalable • Difficult to develop• Expensive to maintain
Today – Embedded Software Is: • Not portable• Not scalable • Difficult to develop• Expensive to maintain
System Development/Acquisition Stages4 Years 4 Years 4 Years
Program MilestonesSystem Tech. DevelopmentSystem Field DemonstrationEngineering/ manufacturing DevelopmentInsertion to Military AssetSignal Processor Evolution 1st gen. 2nd gen. 3rd gen. 4th gen. 5th gen. 6th gen.
• High Performance Embedded Computing pervasive through DoD applications
– Airborne Radar Insertion program 85% software rewrite for each hardware platform
– Missile common processor Processor board costs < $100k Software development costs > $100M
– Torpedo upgrade Two software re-writes required after changes in hardware design
-
Slide-6www.hpec-si.org
MITRE AFRLLincoln
Evolution of Software Support Towards“Write Once, Run Anywhere/Anysize”
1990
ApplicationApplication
VendorSoftwareVendor SW
DoD software development
COTS development
• Application software has traditionally been tied to the hardware
• Support “Write Once, Run Anywhere/Anysize”
20052000
Middleware
VendorSoftware
Application
ApplicationMiddleware
ApplicationMiddleware
ApplicationMiddleware
• Many acquisition programs are developing stove-piped middleware “standards”
• Open software standards can provide portability, performance, and productivity benefits
VendorSoftware
ApplicationMiddleware
Embedded SWStandards
Middleware
-
Slide-7www.hpec-si.org
MITRE AFRLLincoln
Overall Initiative Goals & Impact
Performance (1.5x)
Porta
bilit
y (3
x) Productivity (3x)
HPECSoftwareInitiative
Demonstrate
Develop
Prot
otyp
e
Object Oriented
Open
Sta
ndar
ds
Interoperable & Scalable
Program Goals• Develop and integrate software
technologies for embedded parallel systems to address portability, productivity, and performance
• Engage acquisition community to promote technology insertion
• Deliver quantifiable benefits
Program Goals• Develop and integrate software
technologies for embedded parallel systems to address portability, productivity, and performance
• Engage acquisition community to promote technology insertion
• Deliver quantifiable benefits
Portability: reduction in lines-of-code to change port/scale to new system
Productivity: reduction in overall lines-of-code
Performance: computation and communication benchmarks
-
Slide-8www.hpec-si.org
MITRE AFRLLincoln
HPEC-SI Path to Success
• Reduces software cost & schedule• Enables rapid COTS insertion• Improves cross-program interoperability• Basis for improved capabilities
Benefit to DoD Programs
• Reduces software complexity & risk• Easier comparisons/more competition• Increased functionality
Benefit to DoD Contractors
• Lower software barrier to entry• Reduced software maintenance costs• Evolution of open standards
Benefit to Embedded Vendors
HPEC Software Initiative builds on • Proven technology• Business models• Better software
practices
HPEC Software Initiative builds on • Proven technology• Business models• Better software
practices
-
Slide-9www.hpec-si.org
MITRE AFRLLincoln
Organization
DemonstrationDr. Keith Bromley SPAWARDr. Richard Games MITREDr. Jeremy Kepner, MIT/LL Mr. Brian Sroka MITREMr. Ron Williams MITRE
...
Government Lead
Dr. Rich Linderman AFRL
Technical Advisory BoardDr. Rich Linderman AFRLDr. Richard Games MITREMr. John Grosh OSDMr. Bob Graybill DARPA/ITODr. Keith Bromley SPAWARDr. Mark Richards GTRIDr. Jeremy Kepner MIT/LL
Executive CommitteeDr. Charles Holland PADUSD(S+T)RADM Paul Sullivan N77
DevelopmentDr. James Lebak MIT/LLDr. Mark Richards GTRIMr. Dan Campbell GTRI Mr. Ken Cain MERCURY Mr. Randy Judd SPAWAR
...
Applied ResearchMr. Bob Bond MIT/LLMr. Ken Flowers MERCURYDr. Spaanenburg PENTUMMr. Dennis Cottel SPAWARCapt. Bergmann AFRLDr. Tony Skjellum MPISoft
...
Advanced ResearchMr. Bob Graybill DARPA
• Partnership with ODUSD(S&T), Government Labs, FFRDCs, Universities, Contractors, Vendors and DoD programs
• Over 100 participants from over 20 organizations
• Partnership with ODUSD(S&T), Government Labs, FFRDCs, Universities, Contractors, Vendors and DoD programs
• Over 100 participants from over 20 organizations
-
Slide-10www.hpec-si.org
MITRE AFRLLincoln
Outline
• Standards Overview• Future Standards
• Introduction
• Software Standards
• Parallel VSIPL++
• Future Challenges
• Summary
-
Slide-11www.hpec-si.org
MITRE AFRLLincoln
Emergence of Component Standards
P0 P1 P2 P3
Node Controller
Parallel Embedded Processor
Data Communication:MPI, MPI/RT, DRI
System Controller
ControlCommunication:
CORBA, HP-CORBA
OtherComputersConsoles Computation:
VSIPLVSIPL++, ||VSIPL++
DefinitionsVSIPL = Vector, Signal, and Image
Processing Library||VSIPL++ = Parallel Object Oriented VSIPLMPI = Message-passing interfaceMPI/RT = MPI real-timeDRI = Data Re-org InterfaceCORBA = Common Object Request Broker
ArchitectureHP-CORBA = High Performance CORBA
HPEC Initiative - Builds on completed research and existing
standards and libraries
-
Slide-12www.hpec-si.org
MITRE AFRLLincoln
The Path to Parallel VSIPL++(world’s first parallel object oriented standard)
•First demo successfully completed•VSIPL++ v0.5 spec completed•VSIPL++ v0.1 code available•Parallel VSIPL++ spec in progress•High performance C++ demonstrated
Time
Demonstrate insertions into fielded systems (e.g., CIP)• Demonstrate 3x portability
High-level code abstraction• Reduce code size 3x
Unified embedded computation/ communication standard•Demonstrate scalability
Demonstration: Existing Standards
Development: Object-Oriented Standards
Applied Research: Unified Comp/Comm Lib
Demonstration: Object-Oriented Standards
Demonstration: Unified Comp/Comm Lib
Development: Unified Comp/Comm Lib
Func
tiona
lity
VSIPL++
prototypeParallelVSIPL++
VSIPLMPI
VSIPL++
ParallelVSIPL++
Phase 3Applied Research:
Self-optimizationPhase 2Development: Fault tolerance
Applied Research: Fault tolerance
prototypePhase 1
-
Slide-13www.hpec-si.org
MITRE AFRLLincoln
Working Group Technical Scope
Development
VSIPL++
-MAPPING (task/pipeline parallel)-Reconfiguration (for fault tolerance)-Threads-Reliability/Availability-Data Permutation (DRI functionality)-Tools (profiles, timers, ...)-Quality of Service
-MAPPING (data parallelism)-Early binding (computations)-Compatibility (backward/forward)-Local Knowledge (accessing local data)-Extensibility (adding new functions)-Remote Procedure Calls (CORBA)-C++ Compiler Support-Test Suite-Adoption Incentives (vendor, integrator)
Applied Research
Parallel VSIPL++
-
Slide-14www.hpec-si.org
MITRE AFRLLincoln
Overall Technical Tasks and Schedule
VSIPL (Vector, Signal, and Image Processing Library)
MPI (Message Passing Interface)
VSIPL++ (Object Oriented)v0.1 Specv0.1 Codev0.5 Spec & Codev1.0 Spec & Code
Parallel VSIPL++v0.1 Specv0.1 Codev0.5 Spec & Codev1.0 Spec & Code
Fault Tolerance/Self Optimizing Software
Task Name FY01 FY02 FY03 FY04 FY05 FY06 FY07 FY08Near Mid Long
Applied Research
Development
Demonstrate
CIP
CIP
Demo 2
Demo 2
Demo 3 Demo 4
Demo 5 Demo 6
-
Slide-15www.hpec-si.org
MITRE AFRLLincoln
HPEC-SI Goals1st Demo Achievements
Achieved 10x+Goal 3x
Portability
Achieved 6x*Goal 3x
ProductivityHPEC
SoftwareInitiative
Demonstrate
Develop
Prot
otyp
e
Object Oriented
Open
Sta
ndar
ds
Interoperable & Scalable
Portability: zero code changes requiredProductivity: DRI code 6x smaller vs MPI (est*)Performance: 3x reduced cost or form factor
Performance Goal 1.5x
Achieved 2x
-
Slide-16www.hpec-si.org
MITRE AFRLLincoln
Outline
• Technical Basis• Examples
• Introduction
• Software Standards
• Parallel VSIPL++
• Future Challenges
• Summary
-
Slide-17www.hpec-si.org
MITRE AFRLLincoln
Parallel Pipeline
ParallelComputer
BeamformXOUT = w *XIN
DetectXOUT = |XIN|>c
FilterXOUT = FIR(XIN )
Signal Processing Algorithm
Mapping
• Data Parallel within stages• Task/Pipeline Parallel across stages • Data Parallel within stages• Task/Pipeline Parallel across stages
-
Slide-18www.hpec-si.org
MITRE AFRLLincoln
Types of Parallelism
InputInput
FIRFIlters
FIRFIlters
SchedulerScheduler
Detector2
Detector2
Detector1
Detector1
Beam-former 2Beam-
former 2Beam-
former 1Beam-
former 1
Task ParallelTask Parallel
Pipeline Pipeline
Round RobinRound Robin
Data ParallelData Parallel
-
Slide-19www.hpec-si.org
MITRE AFRLLincoln
Current Approach to Parallel Code
CodeAlgorithm + Mappingwhile(!done){
if ( rank()==1 || rank()==2 )stage1 ();
else if ( rank()==3 || rank()==4 )stage2();
}
Proc1
Proc1
Proc2
Proc2
Stage 1
Proc3
Proc3
Proc4
Proc4
Stage 2
Proc5
Proc5
Proc6
Proc6
while(!done){
if ( rank()==1 || rank()==2 )stage1();
else if ( rank()==3 || rank()==4) ||rank()==5 || rank==6 )
stage2();}
• Algorithm and hardware mapping are linked• Resulting code is non-scalable and non-portable• Algorithm and hardware mapping are linked• Resulting code is non-scalable and non-portable
-
Slide-20www.hpec-si.org
MITRE AFRLLincoln
Scalable Approach
Single Processor Mapping#include #include
void addVectors(aMap, bMap, cMap) {Vector< Complex > a(‘a’, aMap, LENGTH);Vector< Complex > b(‘b’, bMap, LENGTH);Vector< Complex > c(‘c’, cMap, LENGTH);
b = 1;c = 2;a=b+c;
}
A = B + C
Multi Processor Mapping
A = B + C
• Single processor and multi-processor code are the same• Maps can be changed without changing software• High level code is compact
• Single processor and multi-processor code are the same• Maps can be changed without changing software• High level code is compact
Lincoln Parallel Vector Library (PVL)
-
Slide-21www.hpec-si.org
MITRE AFRLLincoln
C++ Expression Templates and PETE
BinaryNode
s
B+C A
Main
Operator +
Operator =
+B& C&
1. Pass B and Creferences tooperator +
4. Pass expression treereference to operator
2. Create expressionparse tree
3. Return expressionparse tree
5. Calculate result andperform assignment
copy &
copy
B&,
C&
Parse trees, not vectors, createdParse trees, not vectors, created
A=B+C*DA=B+C*D
Expression
ExpressiTem onplate
Parse Tree Expression Type
• Expression Templates enhance performance by allowing temporary variables to be avoided
• Expression Templates enhance performance by allowing temporary variables to be avoided
-
Slide-22www.hpec-si.org
MITRE AFRLLincoln
PETE Linux Cluster Experiments
A=B+C A=B+C*D A=B+C*D/E+fft(F)
VSIPLPVL/VSIPLPVL/PETE
VSIPLPVL/VSIPLPVL/PETE
Vector Length Vector Length
0.6
0.8
0.9
1
1.2
0.7
1.1
0.8
0.9
1.1
1
VSIPLPVL/VSIPLPVL/PETE
0.9
1
1.1
1.2
1.3
Rel
ativ
e Ex
ecut
ion
Tim
e
92327
68131
0728
Rel
ativ
e Ex
ecut
ion
Tim
e
Rel
ativ
e Ex
ecut
ion
Tim
e
1.2
32 128 512 2048
8192
32768131
0728 32 128 512 2048
8192
32768131
072832 128 512 2048
81Vector Length
• PVL with VSIPL has a small overhead• PVL with PETE can surpass VSIPL• PVL with VSIPL has a small overhead• PVL with PETE can surpass VSIPL
-
Slide-23www.hpec-si.org
MITRE AFRLLincoln
PowerPC AltiVec Experiments
A=B+C*DA=B+C A=B+C*D+E*F
A=B+C*D+E/FSoftware Technology
Results• Hand coded loop achieves good
performance, but is problem specific and low level
• Optimized VSIPL performs well for simple expressions, worse for more complex expressions
• PETE style array operators perform almost as well as the hand-coded loop and are general, can be composed, and are high-level
AltiVec loop VSIPL (vendor optimized) PETE with AltiVec
• C• For loop• Direct use of AltiVec extensions• Assumes unit stride• Assumes vector alignment
• C• AltiVec aware VSIPro Core Lite
(www.mpi-softtech.com)• No multiply-add• Cannot assume unit stride• Cannot assume vector alignment
• C++• PETE operators• Indirect use of AltiVec extensions• Assumes unit stride• Assumes vector alignment
-
Slide-24www.hpec-si.org
MITRE AFRLLincoln
Outline
• Technical Basis• Examples
• Introduction
• Software Standards
• Parallel VSIPL++
• Future Challenges
• Summary
-
Slide-25www.hpec-si.org
MITRE AFRLLincoln
A = sin(A) + 2 * B;
• Generated code (no temporaries)
for (index i = 0; i < A.size(); ++i)A.put(i, sin(A.get(i)) + 2 * B.get(i));
• Apply inlining to transform to
for (index i = 0; i < A.size(); ++i)Ablock[i] = sin(Ablock[i]) + 2 * Bblock[i];
• Apply more inlining to transform to
T* Bp = &(Bblock[0]); T* Aend = &(Ablock[A.size()]);for (T*Ap = &(Ablock[0]); Ap < pend; ++Ap, ++Bp)*Ap = fmadd (2, *Bp, sin(*Ap));
• Or apply PowerPC AltiVec extensions
• Each step can be automatically generated• Optimization level whatever vendor desires• Each step can be automatically generated• Optimization level whatever vendor desires
-
Slide-26www.hpec-si.org
MITRE AFRLLincoln
BLAS zherk Routine
• BLAS = Basic Linear Algebra Subprograms• Hermitian matrix M: conjug(M) = Mt• zherk performs a rank-k update of Hermitian matrix C:
C ← α ∗ A ∗ conjug(A)t + β ∗ C
• VSIPL codeA = vsip_cmcreate_d(10,15,VSIP_ROW,MEM_NONE);
C = vsip_cmcreate_d(10,10,VSIP_ROW,MEM_NONE);
tmp = vsip_cmcreate_d(10,10,VSIP_ROW,MEM_NONE);
vsip_cmprodh_d(A,A,tmp); /* A*conjug(A)t */
vsip_rscmmul_d(alpha,tmp,tmp);/* α*A*conjug(A)t */vsip_rscmmul_d(beta,C,C); /* β*C */vsip_cmadd_d(tmp,C,C); /* α*A*conjug(A)t + β*C */vsip_cblockdestroy(vsip_cmdestroy_d(tmp));
vsip_cblockdestroy(vsip_cmdestroy_d(C));
vsip_cblockdestroy(vsip_cmdestroy_d(A));
• VSIPL++ code (also parallel)Matrix A(10,15);Matrix C(10,10);C = alpha * prodh(A,A) + beta * C;
-
Slide-27www.hpec-si.org
MITRE AFRLLincoln
Simple Filtering Application
int main () {using namespace vsip;const length ROW S = 64;const length COLS = 4096;
vsipl v;
FFTforward_fft (Domain(RO WS,COLS), 1.0);
FFT inverse_fft (Domain(ROW S,COLS), 1.0);
const Matrix weights (load_weights (ROW S, COLS));
try {while (1) output (inverse_fft (forward_fft (input ()) * weights));
}catch (std::runtime_error) { // Successfully caught access outside domain. }
}
-
Slide-28www.hpec-si.org
MITRE AFRLLincoln
Explicit Parallel Filter
#include
using namespace VSIPL;
const int ROWS = 64;
const int COLS = 4096;
int main (int argc, char **argv)
{
Matrix W (ROWS, COLS, "WMap"); // weights matrix
Matrix X (ROWS, COLS, "WMap"); // input matrix
load_weights (W)
try
{
while (1)
{
input (X); // some input function
Y = IFFT ( mul (FFT(X), W));
output (Y); // some output function
}
}
catch (Exception &e) {cerr
-
Slide-29www.hpec-si.org
MITRE AFRLLincoln
Multi-Stage Filter (main)
using namespace vsip;
const length RO W S = 64;const length COLS = 4096;
int main (int argc, char **argv) {sample_low_pass_filter LPF();sample_beamform BF();sample_matched_filter MF();
try{while (1) output (MF(BF(LPF(input ()))));
}catch (std::runtime_error) { // Successfully caught access outside domain. }
}
-
Slide-30www.hpec-si.org
MITRE AFRLLincoln
Multi-Stage Filter (low pass filter)
templateclass sample_low_pass_filter{public: sample_low_pass_filter(): FIR1_(load_w1 (W1_LENGTH), FIR1_LENGTH),FIR2_(load_w2 (W2_LENGTH), FIR2_LENGTH)
{ }
Matrix operator () (const Matrix& Input) {Matrix output(RO WS, COLS);for (index row=0; row
-
Slide-31www.hpec-si.org
MITRE AFRLLincoln
Multi-Stage Filter (beam former)
templateclass sample_beamform{public:sample_beamform() : W3_(load_w3 (RO WS,COLS)) { }Matrix operator () (const Matrix& Input) const{ return W3_ * Input; }
private:const Matrix W3_;
}
-
Slide-32www.hpec-si.org
MITRE AFRLLincoln
Multi-Stage Filter (matched filter)
templateclass sample_matched_filter{public:matched_filter(): W4_(load_w4 (RO WS,COLS)),forward_fft_ (Domain(RO WS,COLS), 1.0),inverse_fft_ (Domain(RO WS,COLS), 1.0)
{}
Matrix operator () (const Matrix& Input) const { return inverse_fft_ (forward_fft_ (Input) * W4_); }
private:const Matrix W4_;FFT forward_fft_;
FFT inverse_fft_;
}
-
Slide-33www.hpec-si.org
MITRE AFRLLincoln
Outline
• Fault Tolerance• Self Optimization• High Level Languages
• Introduction
• Software Standards
• Parallel VSIPL++
• Future Challenges
• Summary
-
Slide-34www.hpec-si.org
MITRE AFRLLincoln
Dynamic Mapping for Fault Tolerance
Map1
Spare
Failure
Parallel Processor
OutputTask
Map2
Nodes: 1,3
XOUTXIN Nodes: 0,2Map0
Nodes: 0,1
InputTask
• Switching processors is accomplished by switching maps
• No change to algorithm required
• Switching processors is accomplished by switching maps
• No change to algorithm required
-
Slide-35www.hpec-si.org
MITRE AFRLLincoln
Dynamic Mapping Performance Results
0
1
10
100
32 128 512 2048
Remap/BaselineRedundant/BaselinRebind/Baseline
Data Size
Rel
ativ
e Ti
me
• Good dynamic mapping performance is possible• Good dynamic mapping performance is possible
-
Slide-36www.hpec-si.org
MITRE AFRLLincoln
Optimal Mapping of Complex Algorithms
Input
XINXIN
Low Pass Filter
XINXIN
W1W1
FIR1FIR1 XOUTXOUT
W2W2
FIR2FIR2
Beamform
XINXIN
W3W3
multmult XOUTXOUT
Matched Filter
XINXIN
W4W4
FFTFFTIFFTIFFT XOUTXOUT
Application
PowerPCCluster
IntelCluster
Different Optimal Maps
Workstation
EmbeddedMulti-computerEmbedded
Board
Hardware• Need to automate process of
mapping algorithm to hardware• Need to automate process of
mapping algorithm to hardware
-
Slide-37www.hpec-si.org
MITRE AFRLLincoln
Self-optimizing Softwarefor Signal Processing
Problem Size
• Find– Min(latency | #CPU)– Max(throughput | #CPU)
• S3P selects correct optimal mapping
• Excellent agreement between S3P predicted and achieved latencies and throughputs
• Find– Min(latency | #CPU)– Max(throughput | #CPU)
• S3P selects correct optimal mapping
• Excellent agreement between S3P predicted and achieved latencies and throughputs
#CPU
Late
ncy
(sec
onds
)
Large (48x128K)Small (48x4K)
Thro
ughp
ut(fr
ames
/sec
)
4 5 6 7 8
25
20
15
100.25
0.20
0.15
0.10
1.5
1.0
0.5
predictedachieved
predictedachieved
predictedachieved
predictedachieved
5.0
4.0
3.0
2.04 5 6 7 8
#CPU
1-1-1-1
1-1-1-1
1-1-1-1
1-1-1-1
1-1-2-1
1-2-2-1
1-2-2-2 1-3-2-2
1-1-2-1
1-1-2-2
1-2-2-2
1-3-2-2
1-1-1-2
1-1-2-2
1-2-2-2
1-2-2-3
1-1-2-1
1-2-2-1
1-2-2-2
2-2-2-2
-
Slide-38www.hpec-si.org
MITRE AFRLLincoln
High Level Languages
DoD SensorProcessing
High Performance Matlab ApplicationsDoD Mission
PlanningScientific
SimulationCommercialApplications
UserInterface
HardwareInterface
• Parallel Matlab need has been identified•HPCMO (OSU)
• Required user interface has been demonstrated•Matlab*P (MIT/LCS)•PVL (MIT/LL)
• Required hardware interface has been demonstrated•MatlabMPI (MIT/LL)
Parallel MatlabToolbox
Parallel Computing Hardware
• Parallel Matlab Toolbox can now be realized• Parallel Matlab Toolbox can now be realized
-
Slide-39www.hpec-si.org
MITRE AFRLLincoln
MatlabMPI deployment (speedup)
• Maui– Image filtering benchmark (300x on 304 cpus)
• Lincoln– Signal Processing (7.8x on 8 cpus)– Radar simulations (7.5x on 8 cpus)– Hyperspectral (2.9x on 3 cpus)
• MIT– LCS Beowulf (11x Gflops on 9 duals)– AI Lab face recognition (10x on 8 duals)
• Other– Ohio St. EM Simulations– ARL SAR Image Enhancement– Wash U Hearing Aid Simulations– So. Ill. Benchmarking– JHU Digital Beamforming– ISL Radar simulation– URI Heart modeling
• Rapidly growing MatlabMPI user base demonstrates need for parallel matlab
• Demonstrated scaling to 300 processors
• Rapidly growing MatlabMPI user base demonstrates need for parallel matlab
• Demonstrated scaling to 300 processors
0
1
10
100
1 10 100 1000
MatlabMPILinear
Number of ProcessorsPe
rfor
man
ce (G
igaf
lops
)
Image Filtering on IBM SPat Maui Computing Center
-
Slide-40www.hpec-si.org
MITRE AFRLLincoln
Summary
• HPEC-SI Expected benefit– Open software libraries, programming models, and standards
that provide portability (3x), productivity (3x), and performance (1.5x) benefits to multiple DoD programs
• Invitation to Participate– DoD Program offices with Signal/Image Processing needs– Academic and Government Researchers interested in high
performance embedded computing– Contact: [email protected]
-
Slide-41www.hpec-si.org
MITRE AFRLLincoln
The Links
High Performance Embedded Computing Workshophttp://www.ll.mit.edu/HPEC
High Performance Embedded Computing Software Initiativehttp://www.hpec-si.org/
Vector, Signal, and Image Processing Libraryhttp://www.vsipl.org/
MPI Software Technologies, Inc.http://www.mpi-softtech.com/Data Reorganization Initiative
http://www.data-re.org/CodeSourcery, LLC
http://www.codesourcery.com/MatlabMPI
http://www.ll.mit.edu/MatlabMPI
High Performance Embedded ComputingSoftware Initiative (HPEC-SI)OutlineOverview - High Performance Embedded Computing (HPEC) InitiativeWhy Is DoD Concerned with Embedded Software?Issues with Current HPEC Development Inadequacy of Software Practices & StandardsEvolution of Software Support Towards“Write Once, Run Anywhere/Anysize”Overall Initiative Goals & ImpactHPEC-SI Path to SuccessOrganizationOutlineEmergence of Component StandardsThe Path to Parallel VSIPL++Working Group Technical ScopeOverall Technical Tasks and ScheduleHPEC-SI Goals1st Demo AchievementsOutlineParallel PipelineTypes of ParallelismCurrent Approach to Parallel CodeScalable ApproachC++ Expression Templates and PETEPETE Linux Cluster ExperimentsPowerPC AltiVec ExperimentsOutlineA = sin(A) + 2 * B;BLAS zherk RoutineSimple Filtering ApplicationExplicit Parallel FilterMulti-Stage Filter (main)Multi-Stage Filter (low pass filter)Multi-Stage Filter (beam former)Multi-Stage Filter (matched filter)OutlineDynamic Mapping for Fault ToleranceDynamic Mapping Performance ResultsOptimal Mapping of Complex AlgorithmsSelf-optimizing Softwarefor Signal ProcessingHigh Level LanguagesMatlabMPI deployment (speedup)SummaryThe LinksHigh Performance Embedded Computing Workshophttp://www.ll.mit.edu/HPECHigh Performance Embedded Computing Softwar