Post on 06-Jan-2016
description
High Performance Computing on the Grid: Is It for You?
With a Discussion of Help on the Way (the GrADS Project)
Ken KennedyCenter for High Performance Software
Rice University
http://www.cs.rice.edu/~ken/Presentations/GridForYou.pdf
GrADS Principal Investigators
Francine Berman, UCSDAndrew Chien, UCSDKeith Cooper, Rice
Jack Dongarra, TennesseeIan Foster, Chicago
Dennis Gannon, IndianaLennart Johnsson, Houston
Ken Kennedy, RiceCarl Kesselman, USC ISI
John Mellor-Crummey, RiceDan Reed, UIUC
Linda Torczon, RiceRich Wolski, UCSB
GrADS Contributors
Dave Angulo, ChicagoHenri Casanova, UCSD
Holly Dail, UCSDAnshu Dasgupta, Rice
Sridhar Gullapalli, USC ISICharles Koelbel, RiceAnirban Mandal, RiceGabriel Marin, Rice
Mark Mazina, RiceCelso Mendes, UIUCOtto Sievert, UCSD
Martin Swany, UCSBSatish Vadhiyar, TennesseeShannon Whitmore, UIUCAsim Yarkan, Tennessee
National Distributed Problem Solving
Database
Supercomputer
Supercomputer Database
Supercomputer
Supercomputer
Today: Globus
• Developed by Ian Foster and Carl Kesselman—Grew from the I-Way (SC-95)
• Basic services for distributed computing—Resource discovery and information services—User authentication and access control—Job initiation—Communication services (Nexus and MPI)
• Applications are programmed by hand—Many applications—User responsible for resource mapping and all
communication– Existing users acknowledge how hard this is
Today: Condor
• Support for matching application requirements to resources—User and resource provider write ClassAD specifications—System matches ClassADs for applications with ClassADs
for resources– Selects the “best” match based on a user-specified
priority—Can extend to Grid via Globus (Condor-G)
• What is missing?—User must handle application mapping tasks—No dynamic resource selection—No checkpoint/migration (resource re-selection)—Performance matching is straightforward
– Priorities coded into ClassADs
Today: DAGMan
• Support for scheduling jobs connected by partial orders—Overall application represented by a DAG
– Vertices are jobs with file input and output– Edges are dependencies requiring a file transfer
—Scheduling of individual jobs by Condor-G– Whenever a job has all its inputs computed, standard
matchmaking is used to handle scheduling
• What is missing?—No support for co-scheduling
– MPI jobs cannot be handled– Pipelined communication not exploited
—No use of performance models to match jobs to resources—No iterative schedule
Applications that Work on the Grid
• Parameter sweep applications—SETI@home—Master server plus many satellite computations
– No communication, reliability handed via replication
• Workflow applications—Many different applications that pass data using files—Use Condor’s DAGMan to handle scheduling
– File staging is an issue
• Grid services applications—Functions invoked on the Grid via remote procedure call
– Functionality fixed to specific resources– Fixed functionality required (database or telescope)
Future Applications
• Multidisciplinary applications—Example: Aeroelasticity
– Fluid flow and structures on different clusters, with optimization in control
• Large-granularity MPI programs—Load matching is the main issue
• Program with mixture of computation and fixed resources—Computation with database access
– Example: sequencing matching (FASTA and Smith-Waterman)
• Scripts for a function machine—Functions replicated at various grid locations—Scripts in Matlab or other high-level language
GrADS Vision
• Build a National Problem-Solving System on the Grid—Transparent to the user, who sees a problem-solving
system
• Software Support for Application Development on Grids—Goal: Design and build programming systems for the Grid
that broaden the community of users who can develop and run applications in this complex environment
• Challenges:—Presenting a high-level application development interface
– If programming is hard, the Grid will not not reach its potential
—Designing and constructing applications for adaptability—Late mapping of applications to Grid resources—Monitoring and control of performance
– When should the application be interrupted and remapped?
GrADS Strategy
• Goal: Reduce work of preparing an application for Grid execution—Provide generic versions of key components currently built
in to applications– E.g., scheduling, application launch, performance
monitoring
• Key Issue: What is in the application and what is in the system?—GrADS: Application = configurable object program
– Code, mapper, and performance modeler
GrADSoft Architecture
Whole-ProgramCompiler
LibrariesBinder
Real-timePerformance
Monitor
PerformanceProblem
ResourceNegotiator
Scheduler
GridRuntimeSystem
SourceAppli-cation
Config-urableObject
Program
SoftwareComponents
Performance Feedback
Negotiation
Configurable Object Program
• Goal: Provide minimum needed to automate resource selection and program launch
• Code—Today: MPI program—Tomorrow: more general representations
• Mapper—Defines required resources and affinities to specialized
resources—Given a set of resources, maps computation to those
resources– “Optimal” performance, given all requirements met
• Performance Model—Given a set of resources and mapping, estimates
performance—Serves as objective function for Resource
Negotiator/Scheduler
GrADSoft Architecture
Execution Environment
Whole-ProgramCompiler
LibrariesBinder
Real-timePerformance
Monitor
PerformanceProblem
ResourceNegotiator
Scheduler
GridRuntimeSystem
SourceAppli-cation
Config-urableObject
Program
SoftwareComponents
Performance Feedback
Negotiation
Execution Cycle
• Configurable Object Program is presented—Space of feasible resources must be defined—Mapping strategy and performance model provided
• Resource Negotiator solicits acceptable resource collections—Performance model is used to evaluate each—Best match is selected and contracted for
• Execution begins—Binder tailors program to resources
– Carries out final mapping according to mapping strategy
– Inserts sensors and actuators for performance monitoring
• Contract monitoring is performed continuously during execution—Soft violation detection based on fuzzy logic
ConfigurableApplication
GrADS Program Execution System
GridResources
AndServices
ApplicationManager
(one per app)
Launch
BinderMapping
Sensor Insertion
ContractMonitor
Scheduler/ResourceNegotiator
GrADSInformationRepository
Perf Model
Mapper
GrADSoft Architecture
Program Preparation System
Whole-ProgramCompiler
LibrariesBinder
Real-timePerformance
Monitor
PerformanceProblem
ResourceNegotiator
Scheduler
GridRuntimeSystem
SourceAppli-cation
Config-urableObject
Program
SoftwareComponents
Performance Feedback
Negotiation
Program Preparation Tools
• Goal: provide tools to support the construction of Grid-ready applications (in the GrADS framework)
• Performance modeling —Challenge: synthesis and integration of performance
models– Combine expert knowledge, trial execution, and scaled
projections—Focus on binary analysis, derivation of scaling factors
• Mapping—Construction of mappers from parallel programs
– Mapping of task graphs to resources (graph clustering)—Integration of mappers and performance modelers from
components
• High-level programming interfaces—Problem-solving systems: integration of components
Generation of Mappers
• Start from parallel program—Typically written using a communication library (e.g. MPI)—Can be composed from library components
• Construct a task graph—Vertices represent tasks—Edges represent data sharing
– Read-read: undirected edges– Read-write in any order: directed edges (dependences)– Weights represent volume of communication
—Identify oportunities for pipelining
• Use a clustering algorithm to match tasks to resources—One option: global weighted fusion
Constructing Scalable, Portable Models
Construct Application Signatures
• Measure static characteristics
• Measure dynamic characteristics for multiple executions—computation—memory access locality—message frequency and size
• Determine sensitivity of aggregate dynamic characteristics to—data size—processor count—machine characteristics
• Build the model via integration
PSEs for the Grid
• Strategy: Program the Grid in a scripting language—Matlab or some other high-level language—Components professionally developed for Grid usage
– Include mappers and performance estimators
• Compiler tasks:—Build mapper and performance estimator for whole
application from those for individual components—Take advantage of opportunities to optimize sequences of
component invocations into a single one– For example, exploit opportunities for pipelining
• Execution system task:—Find opportunities to run components on machines where
they have been preinstalled and optimized
Testbeds• Goal:
—Provide vehicle for experimentation with the dynamic components of the GrADS software framework
• MacroGrid (Carl Kesselman)—Collection of processors running Globus and GrADS
framework– Consistent software environment
—At all 9 GrADS sites (but 3 are really useful)– Availability listed on web page
—Permits experimentation with real applications
• MicroGrid (Andrew Chien)—Cluster of processors (currently Compaq Alphas and x86
clusters)—Runs standard Grid software (Globus, Nexus, GrADS
middleware)—Permits simulation of varying loads and configurations
– Stress GrADS components (Performance modeling and control)
Research Strategy
• Applications Studies—Prototype a series of applications using components of
envisioned execution system– ScaLAPACK and Cactus demonstration projects
• Move from Hand Development to Automated System—Identify key components that can be isolated and built into
a Grid execution system– e.g., prototype reconfiguration system
—Use experience to elaborate design of software support systems
• Experiment—Use testbeds to evaluate results and refine design
Progress Report
• Testbeds Working
• Preliminary Application Studies Complete—ScaLAPACK and Cactus —GrADS functionality built in
OPUS, TORC, CYPHER
0
500
1000
1500
2000
2500
3000
3500
0 5000 10000 15000 20000
Matrix Size
Tim
e (
sec
on
ds
)
5 OPUS8 OPUS
8 OPUS
8 OPUS, 6 CYPHER
8 OPUS, 2 TORC, 6 CYPHER
6 OPUS, 5 CYPHER
2 OPUS, 4 TORC, 6 CYPHER
8 OPUS, 4 TORC, 4 CYPHER
OPUS OPUS, CYPHER
ScaLAPACK Across 3 Clusters
Largest Problem Solved
• Matrix of size 30,000—7.2 GB for the data—32 processors to choose from at UIUC and UT
– Not all machines have 512 MBs, some little as 128 MBs
—PM chose 17 processors in 2 clusters from UT—Computation took 84 minutes
– 3.6 Gflop/s total– 210 Mflop/s per processor– ScaLAPACK on a cluster of 17 processors would get
about 50% of peak– Processors are 500 MHz or 500 Mflop/s peak– For this Grid computation 20% less than ScaLAPACK
PDSYEVX – Timing BreakdownPDSYEVX - torcs, cyphers
0
1000
2000
3000
4000
5000
6000
1000-1 2000-1 3000-1 4000-2 5000-4 7000-510000-10N-nproc
Time (s)
other_grid_overheadperf_modeling_timenwsmdspdsyevx_driver_overheadback transformationcompute eigenvectorscompute eigenvaluestridiagonal reduction
Uses torcs only
Uses 5 torcs and 5 cyphers
Cactus Gig-E100MB/sec
SDSC IBM SP1024 procs5x12x17 =1020
NCSA Origin Array256+128+1285x12x(4+2+2) =480
OC-12 line(But only 2.5MB/sec)
1712
512
5
4 2 2
• Solved equations for gravitational waves (real code)— Tightly coupled, communications required through derivatives— Must communicate 30MB/step between machines— Time step takes 1.6 sec
• Used 10 ghost zones along direction of machines: communicate every 10 steps
• Compression/decompression on all data passed in this direction
• Achieved 70-80% scaling, ~200GF (only 14% scaling without tricks)
• Gordon Bell Award Winner at SC’2001
Progress Report
• Testbeds Working
• Application Studies Complete—ScaLAPACK and Cactus—GrADS functionality built in
• Prototype Execution System Complete —All components of Execution System (except
rescheduling/migration)—Six applications working in new framework—Demonstrations at SC02
– ScaLAPACK, FASTA, Cactus, GrADSAT– In NPACI, NCSA, Argonne, Tennessee, Rice Booths
—Currently adding in rescheduling capability
• Prototype Program Preparation Tools Under Development—Black-box performance model construction preliminary
experiments—Prototype mapper generator complete
– Generated Grid version of HPF appplication Tomcatv
SC02 Demo Applications
• ScaLAPACK—LU decomposition of large matrices
• Cactus—Solver for gravitational wave equations—Collaboration with Ed Seidel’s GridLAB
• FASTA—Biological sequence matching on distributed databases
• Smith-Waterman—Another sequence matching application using a strong
algorithm
• Tomcatv—Vectorized mesh generation written in HPF
• Satisfiability—An NP-complete problem useful in circuit design and
verification
Summary
• Goal: —Design and build programming systems for the Grid that
broaden the community of users who can develop and run applications in this complex environment
• Strategy:—Build an execution environment that automates the most
difficult tasks – Maps applications to available resources– Manages adapting to varying loads and changing
resources —Automate the process of producing Grid-ready programs
– generate performance models and mapping strategies semi-automatically
—Construct programs using high-level domain-specific programming interfaces
Resources
• GrADS Web Site—http://hipersoft.rice.edu/grads/—Contains:
– Planning reports– Technical reports– Links to papers
High Level Programming
• Rationale—programming is hard, and getting harder with new
platforms—professional programmers are in short supply—high performance will continue to be important
• Strategy: Make the End User a Programmer—professional programmers develop components—users integrate components using:
– problem-solving environments (PSEs) based on scripting languages (possibly graphical) examples: Visual Basic, Tcl/Tk, AVS, Khoros
• Achieving High Performance—translate scripts and components to common intermediate
language—optimize the resulting program using whole-program
compilation
Telescoping Languages
L1 ClassLibrary
L1 ClassLibrary
ScriptScript ScriptTranslator
ScriptTranslator
OptimizedApplication
OptimizedApplication
VendorCompiler
VendorCompiler
CompilerGenerator
CompilerGenerator
Could run for hours
L1 CompilerL1 Compilerunderstandslibrary callsas primitives
Applications
• Matlab Compiler—Automatically generated from LAPACK or ScaLAPACK
• Matlab SP*—Based on signal processing library
• Optimizing Component Integration System—DOE Common Component Architecture — High component
invocation costs
• Generator for ARPACK—Avoid recoding developer version by hand
• System for Analysis of Cancer Experiments*—Based on S+ (collaboration with M.D. Anderson Cancer Center)
• Flexible Data Distributions in HPF—Data distribution == collection of interfaces that meet specs
• Generator for Grid Computations*—GrADS: automatic generation of NetSolve