The Cactus Code: A Parallel, Collaborative, Framework for Large Scale Computing Gabrielle Allen Max...
-
Upload
bridget-rich -
Category
Documents
-
view
215 -
download
0
Transcript of The Cactus Code: A Parallel, Collaborative, Framework for Large Scale Computing Gabrielle Allen Max...
The Cactus Code: A Parallel, Collaborative, Framework for
Large Scale Computing
Gabrielle Allen
Max Planck Institute for Gravitational Physics,
(Albert Einstein Institute)
OutlineOutline
THE GRID: Dependable, consistent, pervasive access
to high-end resources
CACTUS is a freely available, modular,
portable and manageable environment
for collaboratively developing parallel, high-
performance multi-dimensional simulations
HistoryHistory
Cactus originated in 1997 as a code for numerical relativity, following a long line of codes developed in Ed Seidel’s research groups, at the NCSA and recently the AEI.
Numerical Relativity: complicated 3D hyperbolic/elliptic PDEs, dozens of equations, thousands of terms, many people from very different disciplines working together, needing a fast, portable, flexible, easy-to-use, code which can incorporate new technologies without disrupting users.
Originally: Paul Walker, Joan Masso, John Shalf, Ed Seidel.
Cactus 4.0, August 1999: Total rewrite and redesign of code, learning from experiences with previous versions.
t=0
t=100
Need Multi Tflop, Tbyte computing!!
Gravitational Waves Astronomy: New Field, Fundamental New Information about the UniverseGravitational Waves Astronomy: New Field, Fundamental New Information about the Universe
Multi-Teraflop Computation, AMR, Elliptic-Hyperbolic
Numerical Relativity
Numerical Relativity With CactusNumerical Relativity With Cactus
Biggest computations ever: 256 proc O2K at NCSA, 225,000 SU’s, 1Tbyte Output Data in a Few
Weeks
Black Holes (prime source for GW) Increasingly complex collisions: now doing full 3D grazing collisions
Gravitational Waves Study linear waves as testbeds Move on to fully nonlinear waves
Interesting Physics: BH formation in full 3D!
Neutron Stars Developing capability to do full GR
hydro Now can follow full orbits!
What is CactusWhat is Cactus Flesh (ANSI C) provides code infrastructure (parameter, variable, scheduling
databases, error handling, APIs, make, parameter parsing, ) Thorns (F77/F90/C/C++/Java/Perl/Python) are plug-in and swappable modules
or collections of subroutines providing both the computational instructructure and the physical application. Well-defined interface through 3 config files
Just about anything can be implemented as a thorn: Driver layer (MPI, PVM, SHMEM, …), Black Hole evolvers, elliptic solvers, reduction operators, interpolators, web servers, grid tools, IO, …
User driven: easy parallelism, no new paradigms, flexible Collaborative: thorns borrow concepts from OOP, thorns can be shared, lots
of collaborative tools Computational Toolkit: existing thorns for (Parallel) IO, elliptic, MPI unigrid
driver, Integrate other common packages and tools: HDF5, Globus, PETSc, PAPI,
Panda, FlexIO, GrACE, Autopilot, LCAVision, OpenDX, Amira, ... Trivially Grid enabled!
Current Version Cactus 4.0Current Version Cactus 4.0 Cactus 4.0 beta 1 released September
1999 Community code: Distributed under GNU
GPL Currently: Cactus 4.0 beta 8 Supported Architectures:
SGI Origin SGI 32/64 Cray T3E Dec Alpha Intel Linux IA32/IA64 Windows NT HP Exemplar IBM SP2 Sun Solaris Hitachi SR8000-F NEC SX-5 Mac Linux ...
0
20
40
60
80
100
120
0
20
40
60
80
10
0
12
0
Processors
Sc
alin
g
Origin
NT SC
Cactus Scaling on T3E-600
192
760
5980
47900
100
1000
10000
100000
1 10 100 1000
Number of Processors
Cactus on T3E 600 Total Mflops/sec
Cactus Computational Toolkit: Parallel utilities (thorns) for computational scientistCactus Computational Toolkit: Parallel utilities (thorns) for computational scientist
CactusBase Boundary, IOUtil, IOBasic, CartGrid3D,
IOASCII, Time
CactusBench BenchADM
CactusConnect HTTPD, HTTPDExtra
CactusExample WaveToy1DF77, WaveToy2DF77
CactusElliptic EllBase, EllPETSc, EllSOR, EllTest
CactusPUGH Interp, PUGH, PUGHSlab,
PUGHReduce
CactusPUGHIO IOFlexIO, IOHDF5, IsoSurfacer
CactusTest TestArrays, TestCoordinates,
TestInclude1, TestInclude2, TestComplex, TestInterp, TestReduce
CactusWave IDScalarWave, IDScalarWaveC,
IDScalarWaveCXX, WaveBinarySource, WaveToyC, WaveToyCXX, WaveToyF77, WaveToyF90, WaveToyFreeF90
external IEEEIO, RemoteIO, TCPXX, jpeg6b
BetaThorns (In Development) IOStreamedHDF5, IOJpeg,
IOHDF5Util,…, many more
How To Use CactusHow To Use Cactus
Application scientist usually concentrates on the application Physics, Performance, Algorithms Logically: Operations on a grid (structured or unstructured) Program in any language
Then takes advantage of parallel API features enabled by Cactus IO, Data streaming, remote visualization/steering, AMR, MPI/PVM, checkpointing,
Grid Computing, interpolations, reductions, etc… Abstraction allows one to switch between different MPI, PVM layers, different I/O
layers, etc, with no or minimal changes to application! (nearly) All architectures supported and autoconfigured
Common to develop on laptop (w/wo MPI); run on anything Metacode Concept
Very, very lightweight, not a huge framework User specifies desired code modules in configuration files Desired code generated, automatic routine calling sequences, syntax checking,
etc… You can actually read the code it creates...
Cactus CommunityCactus Community DLR
Geophysics(Bosl)
Numerical Relativity CommunityCornell
Crack prop.
NASA NS GC
Livermore
SDSS(Szalay)
Intel
Microsoft
Clemson
“Egrid”NCSA, ANL, SDSC
AEI Cactus Group(Allen)
NSF KDI(Suen)
EU Network(Seidel)
Astrophysics(Zeus)
US Grid ForumDFN Gigabit
(Seidel)
“GRADS”(Kennedy, Foster,
Dongarra, et al)
ChemEng(Bishop)
San Diego, GMD, Cornell
Berkeley
Grid ComputingGrid Computing
AEI Numerical Relativity Group has access to high-end resources in over ten centers in Europe/USA
They want: Bigger simulations, more simulations and faster throughput Intuitive IO at local workstation No new systems/techniques to master!!
How to make best use of these resources? Provide easier access … noone can remember ten usernames, passwords,
batch systems, file systems, … great start!!! Combine resources for larger productions runs (more resolution badly
needed!) Dynamic scenarios … automatically use what is available
Many other reasons for Grid Computing for computer scientists, funding agencies, supercomputer centers ...
Grid-Enabled CactusGrid-Enabled Cactus Cactus and its ancestor codes have been using
Grid infrastructure since 1993 Support for Grid computing was part of the design
requirements for Cactus 4.0 (experiences with Cactus 3)
Cactus compiles “out-of-the-box” with Globus [using globus device of MPICH-G(2)]
Design of Cactus means that applications are unaware of the underlying machine/s that the simulation is running on … applications become trivially Grid-enabled
Infrastructure thorns (I/O, driver layers) can be enhanced to make most effective use of the underlying Grid architecture
Cactus + GlobusCactus + Globus
Cactus Application ThornsDistribution information hidden from programmer
Initial data, Evolution, Analysis, etc
Grid Aware Application ThornsDrivers for parallelism, IO, communication, data mapping
PUGH: parallelism via MPI (MPICH-G2, grid enabled message passing library)
Grid Enabled Communication Library
MPICH-G2 implementation of MPI, can run MPI programs across heterogenous computing
resources
Standard MPI
SingleProc
Grid ExperimentsGrid Experiments SC93
remote CM-5 simulation with live viz in CAVE SC95
Heroic I-Way experiments leads to development of Globus. Cornell SP-2, Power Challenge, with live viz in San Diego CAVE
SC97 Garching 512 node T3E, launched, controlled, visualized in San Jose
SC98 HPC Challenge. SDSC, ZIB, and Garching T3E compute collision of 2 Neutron
Stars, controlled from Orlando SC99
Colliding Black Holes using Garching, ZIB T3E’s, with remote collaborative interaction and viz at ANL and NCSA booths
2000 Single simulation LANL, NCSA, NERSC, SDSC, ZIB, Garching, … Dynamic distributed computing … spawning new simulations!!
Grand PictureGrand PictureRemote steering and monitoring
from airport
Origin: NCSA
Remote Viz in St Louis
T3E: Garching
Simulations launched from Cactus PortalGrid enabled
Cactus runs on distributed machines
Remote Viz and steering from Berlin
Viz of data from previous simulations in
SF café
DataGrid/DPSSDownsampling
Globus
http
HDF5
IsoSurfaces
Demo: Remote ComputingDemo: Remote Computing
Have most of this working now Need to make it common place, and trivially available to
users Requires development of readers/networks for Viz clients
too Remote simulation:
Monitor and steer using thorn HTTPD Display live isosurfacers with thorn isosurfacer/IsoView GUI Display full live viz with HDF5 thorns and OpenDX
Remote VisualizationRemote Visualization
IsoSurfaces and Geodesics
Contour plots(download)
Grid FunctionsStreaming
HDF5
Amira
Amira
LCA Vision
OpenDXOpenDX
Remote VisualizationRemote Visualization
Streaming data from Cactus simulation to viz client Clients: OpenDX, Amira, LCA Vision, ...
Protocols Proprietary:
– Isosurfaces, geodesics HTTP:
– Parameters, xgraph data, JPegs Streaming HDF5:
– HDF5 provides downsampling and hyperslabbing
– all above data, and all possible HDF5 data (e.g. 2D/3D)
– two different technologies• Streaming Virtual File Driver (I/O rerouted over network stream)• XML-wrapper (HDF5 calls wrapped and translated into XML)
Remote Visualization (2)Remote Visualization (2)
Clients Proprietary:
– Amira HTTP:
– Any browser (+ xgraph helper application) HDF5:
– Any HDF5 aware application • h5dump• Amira• OpenDX• LCA Vision (soon)
XML:– Any XML aware application
• Perl/Tk GUI• Future browsers (need XSL-Stylesheets)
Remote Visualization - IssuesRemote Visualization - Issues
Parallel streaming Cactus can do this, but readers not yet available on the client side
Handling of port numbers clients currently have no method for finding the port number that
Cactus is using for streaming development of external meta-data server needed (ASC/TIKSL)
Generic protocols Data server
Cactus should pass data to a separate server that will handle multiple clients without interfering with simulation
TIKSL provides middleware (streaming HDF5) to implement this
Output parameters for each client
Remote SteeringRemote Steering
Remote Viz data
Remote Viz data
XML HTTP
HDF5
Amira
Any Viz Client
Remote SteeringRemote Steering
Stream parameters from Cactus simulation to remote client, which changes parameters (GUI, command line, viz tool), and streams them back to Cactus where they change the state of the simulation.
Cactus has a special STEERABLE tag for parameters, indicating it makes sense to change them during a simulation, and there is support for them to be changed.
Example: IO parameters, frequency, fields, timestep, debugging flags
Current protocols: XML (HDF5) to standalone GUI HDF5 to viz tools (Amira) HTTP to Web browser (HTML forms)
Thorn HTTPDThorn HTTPD Thorn which allows
simulation any to act as its own web server
Connect to simulation from any browser anywhere
Monitor run: parameters, basic visualization, ...
Change steerable parameters
See running example at www.CactusCode.org
Wireless remote viz, monitoring and steering
Remote Steering - IssuesRemote Steering - Issues
Same kinds of problems as remote visualization generic protocols handling of port numbers broadcasting of active Cactus simulations
Security Logins Who can change parameters?
Lots of issues still to resolve ...
Remote Offline VisualizationRemote Offline VisualizationViz Client (Amira)
HDF5 VFD
DataGrid (Globus)
DPSS FTP HTTP
VisualizationClient
DPSS Server
FTP Server
Web Server Remote
Data Server
Downsampling, hyperslabs
Viz in Berlin
4TB at NCSA
Only what is needed
Remote Offline VisualizationRemote Offline Visualization
Accessing remote data for local visualization
Should allow downsampling, hyperslabbing, etc. Access via DPSS is working (TIKSL) Waiting for DataGrid support for HTTP and FTP to remove
dependency on the DPSS file systems.
New Grid ApplicationsNew Grid Applications Dynamic Staging: move to faster/cheaper/bigger machine
“Cactus Worm”
Multiple Universe create clone to investigate steered parameter (“Cactus Virus”)
Automatic Convergence Testing from intitial data or initiated during simulation
Look Ahead spawn off and run coarser resolution to predict likely future
Spawn Independent/Asynchronous Tasks send to cheaper machine, main simulation carries on
Thorn Profiling best machine/queue choose resolution parameters based on queue ….
New Grid Applications (2)New Grid Applications (2) Dynamic Load Balancing
inhomogeneous loads multiple grids
Portal resource choosing simulation launching management
Intelligent Parameter Surveys farm out to different machines
Make use of Running with management tools such as Condor, Entropia, etc. Scripting thorns (management, launching new jobs, etc) Dynamic use of eg MDS for finding available resources
Go!
Dynamic Grid ComputingDynamic Grid Computing
Clone job with steered parameter
Queue time over, find new machine
Add more resources
Found a horizon,try out excision
Look forhorizon
Calculate/OutputGrav. Waves
Calculate/OutputInvariants
Find bestresources
Free CPUs!!
NCSA
SDSC RZG
SDSC
LRZ Archive data
Users ViewUsers View
Cactus WormCactus Worm Egrid Test Bed: 10 Sites Simulation starts on one machine, seeks out new
resources (faster/cheaper/bigger) and migrates there, etc, etc
Uses: Cactus, Globus Protocols: gsissh, gsiftp, streams or copies data Queries Egrid GIIS at each site Publishes simulation information to Egrid GIIS
Demonstrated at Dallas SC2000 Development proceeding with KDI ASC
(USA), TIKSL/GriKSL (Germany), GrADS (USA), Application Group of Egrid (Europe)
Fundamental dynamic Grid application !!! Leads directly to many more applications
Demo: Cactus WormDemo: Cactus Worm
Worm running around 10 sites of the Egrid testbed Currently developing more features/fault tolerance/logging Will run for around 1000 generations (1day) then dies!
Dynamic Grid ComputingDynamic Grid Computing
Fundamental Issues (all needed for Cactus Worm) Dynamic resource selection (query information server) Authentification (how to move files, issue remote shell commands) Executable staging (build on demand, or maintain database?) Data migration (copy, stream, which protocol?) Fault tolerance (essential!!!!) Book-keeping (essential!!!! … where did the output go, what actually
happened?) Publishing of simulation information (information should be available
to you and your collaborators)
User PortalUser Portal Find resources
automatically finds machines with a user allocation (group aware!) continuously monitor resources, network etc.
Authentification single login, don’t need to remember lots of usernames/passwords
Launch simulation automatically create executable on chosen machine write data to appropriate storage location negotiate local queue structures
Monitor/steer simulations access remote visualization and steering while simulation is running collaborative … choose who else can look in and/or steer performance … how efficient is the simulation?
Archiving store thorn lists, parameter files, output locations, configurations, …
Cactus PortalCactus Portal KDI ASC Project Technology: Globus, GSI, Java Beans,
DHTML, Java CoG, MyProxy, GPDK, TomCat, Stronghold
Allows submission of distributed runs
Accesses the ASC Grid Testbed (SDSC, NCSA, Argonne, ZIB, LRZ, AEI)
Undergoing testing by users now! Main difficulty now is that it requires
everything to work … robustness!! But is going to revolutionise our use
of computing resources
Grid Related ProjectsGrid Related Projects ASC: Astrophysics Simulation Collaboratory
NSF Funded (WashU, Rutgers, Argonne, U. Chicago, NCSA) Collaboratory tools, Cactus Portal Starting to use Portal for production runs
E-Grid: European Grid Forum (GGF: Global Grid Forum) Working Group for Testbeds and Applications (Chair: Ed Seidel) Test application: Cactus+Globus Demos at Dallas SC2000
GrADs: Grid Application Development Software NSF Funded (Rice, NCSA, U. Illinois, UCSD, U. Chicago, U.
Indiana...) Application driver for grid software
Grid Related Projects (2)Grid Related Projects (2)
Distributed Runs AEI, Argonne, U. Chicago Working towards running on several computers, 1000’s of processors
(different processors, memories, OSs, resource management, varied networks, bandwidths and latencies)
TIKSL/GriKSL German DFN funded: AEI, ZIB, Garching Remote online and offline visualization, remote steering/monitoring
Cactus Team Dynamic distributed computing … Testing of alternative communication protocols … MPI, PVM,
SHMEM, pthreads, OpenMP, Corba, RDMA, ... Developing Grid Application Development Toolkit
Grid Application Development ToolkitGrid Application Development Toolkit
Application developer should be able to build simulations with tools that easily enable dynamic grid capabilities
Want to build programming API to easily allow: Query information server (e.g. GIIS)
– What’s available for me? What software? How many processors? Network Monitoring Decision Thorns
– How to decide? Cost? Reliability? Size? Spawning Thorns
– Now start this up over here, and that up over there Authentification Server
– Issues commands, moves files on your behalf (can’t pass-on Globus proxy)
Grid Application Development Toolkit (2)Grid Application Development Toolkit (2) Information Server
– What is running where? Where to connect for viz/steering? What and where are other people in the group running?
– Spawn hierarchies– Distribute/loadbalance
Data Transfer– Use whatever method is desired– Gsi-ssh, Gsi-ftp, Streamed HDF5, scp, GASS, Etc…
LDAP routines for simulation codes– Write simulation information in LDAP format– Publish to LDAP server
Stage Executables– CVS checkout of new codes that become connected, etc…
Etc…
If we build this, we can get developers and users!
More Information ...More Information ... Cactus:
Web Site: www.CactusCode.org (Documentation/Tutorials etc) Cactus Worm: www.CactusCode.org/Development/Egrid.html
Global Grid Forum (Egrid) www.egrid.org www.gridforum.org
ASC Portal www.ascportal.org
TIKSL Gigabit Computing www.zib.de/Visual/projects/TIKSL/
Black Holes and Neutron Star: Pictures and Movies jean-luc.aei.mpg.de
Any questions: [email protected]