High Performance Embedded Computing Software Initiative ... · High Performance Embedded Computing...

Slide-1HPEC-SI

MITRE AFRLMIT Lincoln Laboratorywww.hpec-si.org

High Performance Embedded ComputingSoftware Initiative (HPEC-SI)

This work is sponsored by the High Performance Computing Modernization Office under Air Force Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government.

Dr. Jeremy KepnerMIT Lincoln Laboratory

Slide-2www.hpec-si.org

MITRE AFRLLincoln

Outline

• DoD Need• Program Structure• Introduction

• Software Standards

• Parallel VSIPL++

• Future Challenges

• Summary


MITRE AFRLLincoln

Overview - High Performance Embedded Computing (HPEC) Initiative

Enhanced Tactical Radar Correlator(ETRAC)

ASARS-2

Shared memory serverEmbedded

multi-processor

Common Imagery Processor (CIP)

HPEC Software Initiative

Programs

Demonstration

Development

Applied Research

DARPA

Challenge: Transition advanced software technology and practices into major defense acquisition programs


MITRE AFRLLincoln

Why Is DoD Concerned with Embedded Software?

$0.0

$1.0

$2.0

$3.0

FY98

Source: “HPEC Market Study” March 2001

Estimated DoD expenditures for embedded signal and image processing hardware and software ($B)

• COTS acquisition practices have shifted the burden from “point design” hardware to “point design” software

• Software costs for embedded systems could be reduced by one-third with improved programming models, methodologies, and standards

• COTS acquisition practices have shifted the burden from “point design” hardware to “point design” software

• Software costs for embedded systems could be reduced by one-third with improved programming models, methodologies, and standards


MITRE AFRLLincoln

Issues with Current HPEC Development Inadequacy of Software Practices & Standards

NSSNAEGIS

Rivet Joint

Standard Missile

PredatorGlobal Hawk

U-2

JSTARS MSAT-Air

P-3/APS-137

FF--1616

MK-48 Torpedo

Today – Embedded Software Is: • Not portable• Not scalable • Difficult to develop• Expensive to maintain

Today – Embedded Software Is: • Not portable• Not scalable • Difficult to develop• Expensive to maintain

System Development/Acquisition Stages4 Years 4 Years 4 Years

Program MilestonesSystem Tech. DevelopmentSystem Field DemonstrationEngineering/ manufacturing DevelopmentInsertion to Military AssetSignal Processor Evolution 1st gen. 2nd gen. 3rd gen. 4th gen. 5th gen. 6th gen.

• High Performance Embedded Computing pervasive through DoD applications

– Airborne Radar Insertion program 85% software rewrite for each hardware platform

– Missile common processor Processor board costs < $100k Software development costs > $100M

– Torpedo upgrade Two software re-writes required after changes in hardware design


MITRE AFRLLincoln

Evolution of Software Support Towards“Write Once, Run Anywhere/Anysize”

1990

ApplicationApplication

VendorSoftwareVendor SW

DoD software development

COTS development

• Application software has traditionally been tied to the hardware

• Support “Write Once, Run Anywhere/Anysize”

20052000

Middleware

VendorSoftware

Application

ApplicationMiddleware



• Many acquisition programs are developing stove-piped middleware “standards”

• Open software standards can provide portability, performance, and productivity benefits

VendorSoftware


Embedded SWStandards

Middleware


MITRE AFRLLincoln

Overall Initiative Goals & Impact

Performance (1.5x)

Porta

bilit

y (3

x) Productivity (3x)

HPECSoftwareInitiative

Demonstrate

Develop

Prot

otyp

e

Object Oriented

Open

Sta

ndar

ds

Interoperable & Scalable

Program Goals• Develop and integrate software

technologies for embedded parallel systems to address portability, productivity, and performance

• Engage acquisition community to promote technology insertion

• Deliver quantifiable benefits

Program Goals• Develop and integrate software

technologies for embedded parallel systems to address portability, productivity, and performance

• Engage acquisition community to promote technology insertion

• Deliver quantifiable benefits

Portability: reduction in lines-of-code to change port/scale to new system

Productivity: reduction in overall lines-of-code

Performance: computation and communication benchmarks


MITRE AFRLLincoln

HPEC-SI Path to Success

• Reduces software cost & schedule• Enables rapid COTS insertion• Improves cross-program interoperability• Basis for improved capabilities

Benefit to DoD Programs

• Reduces software complexity & risk• Easier comparisons/more competition• Increased functionality

Benefit to DoD Contractors

• Lower software barrier to entry• Reduced software maintenance costs• Evolution of open standards

Benefit to Embedded Vendors

HPEC Software Initiative builds on • Proven technology• Business models• Better software

practices

HPEC Software Initiative builds on • Proven technology• Business models• Better software

practices


MITRE AFRLLincoln

Organization

DemonstrationDr. Keith Bromley SPAWARDr. Richard Games MITREDr. Jeremy Kepner, MIT/LL Mr. Brian Sroka MITREMr. Ron Williams MITRE

...

Government Lead

Dr. Rich Linderman AFRL

Technical Advisory BoardDr. Rich Linderman AFRLDr. Richard Games MITREMr. John Grosh OSDMr. Bob Graybill DARPA/ITODr. Keith Bromley SPAWARDr. Mark Richards GTRIDr. Jeremy Kepner MIT/LL

Executive CommitteeDr. Charles Holland PADUSD(S+T)RADM Paul Sullivan N77

DevelopmentDr. James Lebak MIT/LLDr. Mark Richards GTRIMr. Dan Campbell GTRI Mr. Ken Cain MERCURY Mr. Randy Judd SPAWAR

...

Applied ResearchMr. Bob Bond MIT/LLMr. Ken Flowers MERCURYDr. Spaanenburg PENTUMMr. Dennis Cottel SPAWARCapt. Bergmann AFRLDr. Tony Skjellum MPISoft

...

Advanced ResearchMr. Bob Graybill DARPA

• Partnership with ODUSD(S&T), Government Labs, FFRDCs, Universities, Contractors, Vendors and DoD programs

• Over 100 participants from over 20 organizations

• Partnership with ODUSD(S&T), Government Labs, FFRDCs, Universities, Contractors, Vendors and DoD programs

• Over 100 participants from over 20 organizations


MITRE AFRLLincoln

Outline

• Standards Overview• Future Standards

• Introduction




• Summary


MITRE AFRLLincoln

Emergence of Component Standards

P0 P1 P2 P3

Node Controller

Parallel Embedded Processor

Data Communication:MPI, MPI/RT, DRI

System Controller

ControlCommunication:

CORBA, HP-CORBA

OtherComputersConsoles Computation:

VSIPLVSIPL++, ||VSIPL++

DefinitionsVSIPL = Vector, Signal, and Image

Processing Library||VSIPL++ = Parallel Object Oriented VSIPLMPI = Message-passing interfaceMPI/RT = MPI real-timeDRI = Data Re-org InterfaceCORBA = Common Object Request Broker

ArchitectureHP-CORBA = High Performance CORBA

HPEC Initiative - Builds on completed research and existing

standards and libraries


MITRE AFRLLincoln

The Path to Parallel VSIPL++(world’s first parallel object oriented standard)

•First demo successfully completed•VSIPL++ v0.5 spec completed•VSIPL++ v0.1 code available•Parallel VSIPL++ spec in progress•High performance C++ demonstrated

Time

Demonstrate insertions into fielded systems (e.g., CIP)• Demonstrate 3x portability

High-level code abstraction• Reduce code size 3x

Unified embedded computation/ communication standard•Demonstrate scalability

Demonstration: Existing Standards

Development: Object-Oriented Standards

Applied Research: Unified Comp/Comm Lib

Demonstration: Object-Oriented Standards

Demonstration: Unified Comp/Comm Lib

Development: Unified Comp/Comm Lib

Func

tiona

lity

VSIPL++

prototypeParallelVSIPL++

VSIPLMPI

VSIPL++

ParallelVSIPL++

Phase 3Applied Research:

Self-optimizationPhase 2Development: Fault tolerance

Applied Research: Fault tolerance

prototypePhase 1


MITRE AFRLLincoln

Working Group Technical Scope

Development

VSIPL++

-MAPPING (task/pipeline parallel)-Reconfiguration (for fault tolerance)-Threads-Reliability/Availability-Data Permutation (DRI functionality)-Tools (profiles, timers, ...)-Quality of Service

-MAPPING (data parallelism)-Early binding (computations)-Compatibility (backward/forward)-Local Knowledge (accessing local data)-Extensibility (adding new functions)-Remote Procedure Calls (CORBA)-C++ Compiler Support-Test Suite-Adoption Incentives (vendor, integrator)

Applied Research

Parallel VSIPL++


MITRE AFRLLincoln

Overall Technical Tasks and Schedule

VSIPL (Vector, Signal, and Image Processing Library)

MPI (Message Passing Interface)

VSIPL++ (Object Oriented)v0.1 Specv0.1 Codev0.5 Spec & Codev1.0 Spec & Code

Parallel VSIPL++v0.1 Specv0.1 Codev0.5 Spec & Codev1.0 Spec & Code

Fault Tolerance/Self Optimizing Software

Task Name FY01 FY02 FY03 FY04 FY05 FY06 FY07 FY08Near Mid Long

Applied Research

Development

Demonstrate

CIP

CIP

Demo 2

Demo 2

Demo 3 Demo 4

Demo 5 Demo 6


MITRE AFRLLincoln

HPEC-SI Goals1st Demo Achievements

Achieved 10x+Goal 3x

Portability

Achieved 6x*Goal 3x

ProductivityHPEC

SoftwareInitiative

Demonstrate

Develop

Prot

otyp

e

Object Oriented

Open

Sta

ndar

ds

Interoperable & Scalable

Portability: zero code changes requiredProductivity: DRI code 6x smaller vs MPI (est*)Performance: 3x reduced cost or form factor

Performance Goal 1.5x

Achieved 2x


MITRE AFRLLincoln

Outline

• Technical Basis• Examples

• Introduction




• Summary


MITRE AFRLLincoln

Parallel Pipeline

ParallelComputer

BeamformXOUT = w *XIN

DetectXOUT = |XIN|>c

FilterXOUT = FIR(XIN )

Signal Processing Algorithm

Mapping

• Data Parallel within stages• Task/Pipeline Parallel across stages • Data Parallel within stages• Task/Pipeline Parallel across stages


MITRE AFRLLincoln

Types of Parallelism

InputInput

FIRFIlters

FIRFIlters

SchedulerScheduler

Detector2

Detector2

Detector1

Detector1

Beam-former 2Beam-

former 2Beam-

former 1Beam-

former 1

Task ParallelTask Parallel

Pipeline Pipeline

Round RobinRound Robin

Data ParallelData Parallel


MITRE AFRLLincoln

Current Approach to Parallel Code

CodeAlgorithm + Mappingwhile(!done){

if ( rank()==1 || rank()==2 )stage1 ();

else if ( rank()==3 || rank()==4 )stage2();

}

Proc1

Proc1

Proc2

Proc2

Stage 1

Proc3

Proc3

Proc4

Proc4

Stage 2

Proc5

Proc5

Proc6

Proc6

while(!done){

if ( rank()==1 || rank()==2 )stage1();

else if ( rank()==3 || rank()==4) ||rank()==5 || rank==6 )

stage2();}

• Algorithm and hardware mapping are linked• Resulting code is non-scalable and non-portable• Algorithm and hardware mapping are linked• Resulting code is non-scalable and non-portable


MITRE AFRLLincoln

Scalable Approach

Single Processor Mapping#include #include

void addVectors(aMap, bMap, cMap) {Vector< Complex > a(‘a’, aMap, LENGTH);Vector< Complex > b(‘b’, bMap, LENGTH);Vector< Complex > c(‘c’, cMap, LENGTH);

b = 1;c = 2;a=b+c;

}

A = B + C

Multi Processor Mapping

A = B + C

• Single processor and multi-processor code are the same• Maps can be changed without changing software• High level code is compact

• Single processor and multi-processor code are the same• Maps can be changed without changing software• High level code is compact

Lincoln Parallel Vector Library (PVL)


MITRE AFRLLincoln

C++ Expression Templates and PETE

BinaryNode

s

B+C A

Main

Operator +

Operator =

+B& C&

1. Pass B and Creferences tooperator +

4. Pass expression treereference to operator

2. Create expressionparse tree

3. Return expressionparse tree

5. Calculate result andperform assignment

copy &

copy

B&,

C&

Parse trees, not vectors, createdParse trees, not vectors, created

A=B+C*DA=B+C*D

Expression

ExpressiTem onplate

Parse Tree Expression Type

• Expression Templates enhance performance by allowing temporary variables to be avoided

• Expression Templates enhance performance by allowing temporary variables to be avoided


MITRE AFRLLincoln

PETE Linux Cluster Experiments

A=B+C A=B+C*D A=B+C*D/E+fft(F)

VSIPLPVL/VSIPLPVL/PETE


Vector Length Vector Length

0.6

0.8

0.9

1

1.2

0.7

1.1

0.8

0.9

1.1

1


0.9

1

1.1

1.2

1.3

Rel

ativ

e Ex

ecut

ion

Tim

e

92327

68131

0728

Rel

ativ

e Ex

ecut

ion

Tim

e

Rel

ativ

e Ex

ecut

ion

Tim

e

1.2

32 128 512 2048

8192

32768131

0728 32 128 512 2048

8192

32768131

072832 128 512 2048

81Vector Length

• PVL with VSIPL has a small overhead• PVL with PETE can surpass VSIPL• PVL with VSIPL has a small overhead• PVL with PETE can surpass VSIPL


MITRE AFRLLincoln

PowerPC AltiVec Experiments

A=B+C*DA=B+C A=B+C*D+E*F

A=B+C*D+E/FSoftware Technology

Results• Hand coded loop achieves good

performance, but is problem specific and low level

• Optimized VSIPL performs well for simple expressions, worse for more complex expressions

• PETE style array operators perform almost as well as the hand-coded loop and are general, can be composed, and are high-level

AltiVec loop VSIPL (vendor optimized) PETE with AltiVec

• C• For loop• Direct use of AltiVec extensions• Assumes unit stride• Assumes vector alignment

• C• AltiVec aware VSIPro Core Lite

(www.mpi-softtech.com)• No multiply-add• Cannot assume unit stride• Cannot assume vector alignment

• C++• PETE operators• Indirect use of AltiVec extensions• Assumes unit stride• Assumes vector alignment


MITRE AFRLLincoln

Outline

• Technical Basis• Examples

• Introduction




• Summary


MITRE AFRLLincoln

A = sin(A) + 2 * B;

• Generated code (no temporaries)

for (index i = 0; i < A.size(); ++i)A.put(i, sin(A.get(i)) + 2 * B.get(i));

• Apply inlining to transform to

for (index i = 0; i < A.size(); ++i)Ablock[i] = sin(Ablock[i]) + 2 * Bblock[i];

• Apply more inlining to transform to

T* Bp = &(Bblock[0]); T* Aend = &(Ablock[A.size()]);for (T*Ap = &(Ablock[0]); Ap < pend; ++Ap, ++Bp)*Ap = fmadd (2, *Bp, sin(*Ap));

• Or apply PowerPC AltiVec extensions

• Each step can be automatically generated• Optimization level whatever vendor desires• Each step can be automatically generated• Optimization level whatever vendor desires


MITRE AFRLLincoln

BLAS zherk Routine

• BLAS = Basic Linear Algebra Subprograms• Hermitian matrix M: conjug(M) = Mt• zherk performs a rank-k update of Hermitian matrix C:

C ← α ∗ A ∗ conjug(A)t + β ∗ C

• VSIPL codeA = vsip_cmcreate_d(10,15,VSIP_ROW,MEM_NONE);

C = vsip_cmcreate_d(10,10,VSIP_ROW,MEM_NONE);

tmp = vsip_cmcreate_d(10,10,VSIP_ROW,MEM_NONE);

vsip_cmprodh_d(A,A,tmp); /* A*conjug(A)t */

vsip_rscmmul_d(alpha,tmp,tmp);/* α*A*conjug(A)t */vsip_rscmmul_d(beta,C,C); /* β*C */vsip_cmadd_d(tmp,C,C); /* α*A*conjug(A)t + β*C */vsip_cblockdestroy(vsip_cmdestroy_d(tmp));

vsip_cblockdestroy(vsip_cmdestroy_d(C));

vsip_cblockdestroy(vsip_cmdestroy_d(A));

• VSIPL++ code (also parallel)Matrix A(10,15);Matrix C(10,10);C = alpha * prodh(A,A) + beta * C;


MITRE AFRLLincoln

Simple Filtering Application

int main () {using namespace vsip;const length ROW S = 64;const length COLS = 4096;

vsipl v;

FFTforward_fft (Domain(RO WS,COLS), 1.0);

FFT inverse_fft (Domain(ROW S,COLS), 1.0);

const Matrix weights (load_weights (ROW S, COLS));

try {while (1) output (inverse_fft (forward_fft (input ()) * weights));

}catch (std::runtime_error) { // Successfully caught access outside domain. }

}


MITRE AFRLLincoln

Explicit Parallel Filter

#include

using namespace VSIPL;

const int ROWS = 64;

const int COLS = 4096;

int main (int argc, char **argv)

{

Matrix W (ROWS, COLS, "WMap"); // weights matrix

Matrix X (ROWS, COLS, "WMap"); // input matrix

load_weights (W)

try

{

while (1)

{

input (X); // some input function

Y = IFFT ( mul (FFT(X), W));

output (Y); // some output function

}

}

catch (Exception &e) {cerr


MITRE AFRLLincoln

Multi-Stage Filter (main)

using namespace vsip;

const length RO W S = 64;const length COLS = 4096;

int main (int argc, char **argv) {sample_low_pass_filter LPF();sample_beamform BF();sample_matched_filter MF();

try{while (1) output (MF(BF(LPF(input ()))));

}catch (std::runtime_error) { // Successfully caught access outside domain. }

}


MITRE AFRLLincoln

Multi-Stage Filter (low pass filter)

templateclass sample_low_pass_filter{public: sample_low_pass_filter(): FIR1_(load_w1 (W1_LENGTH), FIR1_LENGTH),FIR2_(load_w2 (W2_LENGTH), FIR2_LENGTH)

{ }

Matrix operator () (const Matrix& Input) {Matrix output(RO WS, COLS);for (index row=0; row


MITRE AFRLLincoln

Multi-Stage Filter (beam former)

templateclass sample_beamform{public:sample_beamform() : W3_(load_w3 (RO WS,COLS)) { }Matrix operator () (const Matrix& Input) const{ return W3_ * Input; }

private:const Matrix W3_;

}


MITRE AFRLLincoln

Multi-Stage Filter (matched filter)

templateclass sample_matched_filter{public:matched_filter(): W4_(load_w4 (RO WS,COLS)),forward_fft_ (Domain(RO WS,COLS), 1.0),inverse_fft_ (Domain(RO WS,COLS), 1.0)

{}

Matrix operator () (const Matrix& Input) const { return inverse_fft_ (forward_fft_ (Input) * W4_); }

private:const Matrix W4_;FFT forward_fft_;

FFT inverse_fft_;

}


MITRE AFRLLincoln

Outline

• Fault Tolerance• Self Optimization• High Level Languages

• Introduction




• Summary


MITRE AFRLLincoln

Dynamic Mapping for Fault Tolerance

Map1

Spare

Failure

Parallel Processor

OutputTask

Map2

Nodes: 1,3

XOUTXIN Nodes: 0,2Map0

Nodes: 0,1

InputTask

• Switching processors is accomplished by switching maps

• No change to algorithm required

• Switching processors is accomplished by switching maps

• No change to algorithm required


MITRE AFRLLincoln

Dynamic Mapping Performance Results

0

1

10

100

32 128 512 2048

Remap/BaselineRedundant/BaselinRebind/Baseline

Data Size

Rel

ativ

e Ti

me

• Good dynamic mapping performance is possible• Good dynamic mapping performance is possible


MITRE AFRLLincoln

Optimal Mapping of Complex Algorithms

Input

XINXIN

Low Pass Filter

XINXIN

W1W1

FIR1FIR1 XOUTXOUT

W2W2

FIR2FIR2

Beamform

XINXIN

W3W3

multmult XOUTXOUT

Matched Filter

XINXIN

W4W4

FFTFFTIFFTIFFT XOUTXOUT

Application

PowerPCCluster

IntelCluster

Different Optimal Maps

Workstation

EmbeddedMulti-computerEmbedded

Board

Hardware• Need to automate process of

mapping algorithm to hardware• Need to automate process of

mapping algorithm to hardware


MITRE AFRLLincoln

Self-optimizing Softwarefor Signal Processing

Problem Size

• Find– Min(latency | #CPU)– Max(throughput | #CPU)

• S3P selects correct optimal mapping

• Excellent agreement between S3P predicted and achieved latencies and throughputs

• Find– Min(latency | #CPU)– Max(throughput | #CPU)

• S3P selects correct optimal mapping

• Excellent agreement between S3P predicted and achieved latencies and throughputs

#CPU

Late

ncy

(sec

onds

)

Large (48x128K)Small (48x4K)

Thro

ughp

ut(fr

ames

/sec

)

4 5 6 7 8

25

20

15

100.25

0.20

0.15

0.10

1.5

1.0

0.5

predictedachieved

predictedachieved

predictedachieved

predictedachieved

5.0

4.0

3.0

2.04 5 6 7 8

#CPU

1-1-1-1

1-1-1-1

1-1-1-1

1-1-1-1

1-1-2-1

1-2-2-1

1-2-2-2 1-3-2-2

1-1-2-1

1-1-2-2

1-2-2-2

1-3-2-2

1-1-1-2

1-1-2-2

1-2-2-2

1-2-2-3

1-1-2-1

1-2-2-1

1-2-2-2

2-2-2-2


MITRE AFRLLincoln

High Level Languages

DoD SensorProcessing

High Performance Matlab ApplicationsDoD Mission

PlanningScientific

SimulationCommercialApplications

UserInterface

HardwareInterface

• Parallel Matlab need has been identified•HPCMO (OSU)

• Required user interface has been demonstrated•Matlab*P (MIT/LCS)•PVL (MIT/LL)

• Required hardware interface has been demonstrated•MatlabMPI (MIT/LL)

Parallel MatlabToolbox

Parallel Computing Hardware

• Parallel Matlab Toolbox can now be realized• Parallel Matlab Toolbox can now be realized


MITRE AFRLLincoln

MatlabMPI deployment (speedup)

• Maui– Image filtering benchmark (300x on 304 cpus)

• Lincoln– Signal Processing (7.8x on 8 cpus)– Radar simulations (7.5x on 8 cpus)– Hyperspectral (2.9x on 3 cpus)

• MIT– LCS Beowulf (11x Gflops on 9 duals)– AI Lab face recognition (10x on 8 duals)

• Other– Ohio St. EM Simulations– ARL SAR Image Enhancement– Wash U Hearing Aid Simulations– So. Ill. Benchmarking– JHU Digital Beamforming– ISL Radar simulation– URI Heart modeling

• Rapidly growing MatlabMPI user base demonstrates need for parallel matlab

• Demonstrated scaling to 300 processors

• Rapidly growing MatlabMPI user base demonstrates need for parallel matlab

• Demonstrated scaling to 300 processors

0

1

10

100

1 10 100 1000

MatlabMPILinear

Number of ProcessorsPe

rfor

man

ce (G

igaf

lops

)

Image Filtering on IBM SPat Maui Computing Center


MITRE AFRLLincoln

Summary

• HPEC-SI Expected benefit– Open software libraries, programming models, and standards

that provide portability (3x), productivity (3x), and performance (1.5x) benefits to multiple DoD programs

• Invitation to Participate– DoD Program offices with Signal/Image Processing needs– Academic and Government Researchers interested in high

performance embedded computing– Contact: [email protected]


MITRE AFRLLincoln

The Links

High Performance Embedded Computing Workshophttp://www.ll.mit.edu/HPEC

High Performance Embedded Computing Software Initiativehttp://www.hpec-si.org/

Vector, Signal, and Image Processing Libraryhttp://www.vsipl.org/

MPI Software Technologies, Inc.http://www.mpi-softtech.com/Data Reorganization Initiative

http://www.data-re.org/CodeSourcery, LLC

http://www.codesourcery.com/MatlabMPI

http://www.ll.mit.edu/MatlabMPI

High Performance Embedded ComputingSoftware Initiative (HPEC-SI)OutlineOverview - High Performance Embedded Computing (HPEC) InitiativeWhy Is DoD Concerned with Embedded Software?Issues with Current HPEC Development Inadequacy of Software Practices & StandardsEvolution of Software Support Towards“Write Once, Run Anywhere/Anysize”Overall Initiative Goals & ImpactHPEC-SI Path to SuccessOrganizationOutlineEmergence of Component StandardsThe Path to Parallel VSIPL++Working Group Technical ScopeOverall Technical Tasks and ScheduleHPEC-SI Goals1st Demo AchievementsOutlineParallel PipelineTypes of ParallelismCurrent Approach to Parallel CodeScalable ApproachC++ Expression Templates and PETEPETE Linux Cluster ExperimentsPowerPC AltiVec ExperimentsOutlineA = sin(A) + 2 * B;BLAS zherk RoutineSimple Filtering ApplicationExplicit Parallel FilterMulti-Stage Filter (main)Multi-Stage Filter (low pass filter)Multi-Stage Filter (beam former)Multi-Stage Filter (matched filter)OutlineDynamic Mapping for Fault ToleranceDynamic Mapping Performance ResultsOptimal Mapping of Complex AlgorithmsSelf-optimizing Softwarefor Signal ProcessingHigh Level LanguagesMatlabMPI deployment (speedup)SummaryThe LinksHigh Performance Embedded Computing Workshophttp://www.ll.mit.edu/HPECHigh Performance Embedded Computing Softwar

High Performance Embedded Computing Software Initiative ... · High Performance Embedded Computing...

Documents

Transcript of High Performance Embedded Computing Software Initiative ... · High Performance Embedded Computing...