Metacomputing: What’s in it for ME? Legion 1.3 Greg Lindahl, Andrew Grimshaw, Adam Ferrari,...

29
Metacomputing: What’s in it for ME? Legion 1.3 Greg Lindahl, Andrew Grimshaw, Adam Ferrari, Katherine Holcomb This work partially supported by DARPA(Navy) contract # N66001-96-C-8527, DOE grant DE-FD02-96ER25290, DOE contract Sandia LD-9391, Northrup-Grumman (for the DoD HPCMOD/PET program), DOE D459000-16-3C and DARPA (GA) SC H607305A

Transcript of Metacomputing: What’s in it for ME? Legion 1.3 Greg Lindahl, Andrew Grimshaw, Adam Ferrari,...

Metacomputing:What’s in it for ME?

Legion 1.3

Greg Lindahl,

Andrew Grimshaw, Adam Ferrari, Katherine Holcomb

This work partially supported by DARPA(Navy) contract # N66001-96-C-8527, DOE grant DE-FD02-96ER25290, DOE contract Sandia LD-9391, Northrup-Grumman (for the DoD HPCMOD/PET program),DOE D459000-16-3C and DARPA (GA) SC H607305A

What is a Metasystem, anyway?

Geographically separated collection of people, computers, storage

A fast network to connect them all Software which makes this mess easy to use

It should be as easy to use as the single machine on your desktop

It should be easy to collaborate with other people all over the world

Fewer meterological centers are islands today

So what’s wrong with existing systems?

Different sites have different software environments, even for identical OSes

Different sites don’t share filesystems, wasting your time ftping files around

Security policies often make using multiple sites much less convenient

But the fundamental problem is...

Stretching the old model (interacting but autonomous computers) to larger and larger systems results in incomplete and incompatible solutions

These solutions don’t scale for the future, nor work together today.

High-High-performancperformancee

DistributedDistributed SecureSecure Fault-Fault-

toleranttolerant TransparentTransparent

Our Vision - One Transparent SystemOur Vision - One Transparent System

Metacomputing Benefits

More effective collaboration, by sharing a workspace video conferencing is not enough

Higher performance, from use of off-site resources and easier construction of parallel and coupled applications

Improved productivity from a simpler environment the holy grail that no one has delivered

Example Application

Multi-scale climate modeling El Nino: global O/AGCM coupled with regional

weather model

Features: complicated information exchange written by different groups in different languages

UCLA model uses Cray pointers, runs well on T3E. Regional model isn’t parallel.

run best on very different hardware (T3E, T90)

Issues in our example

Vendor software such as MPI doesn’t interoperate

Hard to start executing in 2 places at once (queues)

Fault-tolerance of whole system (more pieces to break)

Cross-site security Coupling to visualization tools

What Legion Provides

Legion philosophy (CS slide)

Provide flexible mechanisms, not fixed policies

Allow users/application designers to choose a point in this space, based on their own requirements.

level of servicelevel of service

kind o

f

kind o

f

serv

ice

serv

ice

cost

cost

Legion’s Concrete Benefits

Transparent, remote access to files Transparent, remote execution Wide-area parallel processing

bag of tasks large parallel apps

Meta applications still sounds like a bunch of jargon, doesn’t it?

Remote file access today

NFS. Requires super-user configuration, has awful security properties. Only one kind of synchronization, and it’s different from local files

SMBFS: Ditto. OK, so users can share files, but it’s a security hole.

Web: read-only, unless you use really bad security methods

File Access Tomorrow

Legion allows users to share persistent objects, securely, anywhere

Files just happen to have a particular interface All properties set on a per-file basis: security,

fault-tolerance, caching, special interfaces

Application-specific interfaces Today: only sequential and one type of array Legion supports anything you can think of,

including file-like objects which are actually whole simulations

Remote Execution Today

Pick which resource to use (by hand) check load averages, think about problem size,

guess turn-around time

Copy your files around (program, data) a big pain if you don’t share filesystems

Deal with the queuing system Complicated enough that most people just

stick with 1 resource, even if it isn’t right for much of their work

Remote Execution Tomorrow

Run the program “at your workstation”

OK, OK, I really meant...

The system should help automate picking the right resource help you optimize your turn-around time

The system should get your files there even if it’s at a remote site

The system should provide uniform access to queuing systems

Wide-area Parallel Processing

Consider a parameter space study serial or parallel program, “big” or “small”, run

many times on slightly different input data Sometimes called “bag of tasks”

Great way to piss off your friends today queue systems don’t like it when you submit

10,000 jobs hard to pick the right resource, but picking the

wrong one gets you into trouble

Bag of Tasks Tomorrow

A metacomputing system can do this problem well: latency tolerant (by design) can use remote resources, or multiple resources

within one site

Big Parallel Apps

Consider a big, explicit ocean model MICOM, maybe NLOM

App is sufficiently latency-tolerant to use 2 machines for one run bigger problem sizes tolerate more latency across the room (few microseconds) across the country (80 milliseconds)

Metacomputing provides the pieces to make this happen… more than cross-box MPI

Meta Applications

Multi-component program Pieces used to be separate programs Programs have hardware affinities Programs have big datasets which live at

geographically-remote sites Today’s “couplers” are expensive to build,

expensive to run (human time) Tomorrow’s couplers will hopefully be easier

The User’s View of Legion

Authentication (login)

legion_login <userid> currently uses a password; other mechanisms can

be easily added (SecureID) the “login object” generates a certificate this certificate identifies you in the future ideally, one “login object” should be able to give

you access to all your MSRC accounts a goal of cross-domain Kerberos, but will it be

accomplished?

Unified Console

Run-time output flows back to one or more “tty objects”

One or more windows can “watch” the tty Dynamic connection and disconnection of

both writers and watchers Secured by usual Legion security methods Benefits: flexibility, fault-tolerant, sharing,

security

Location-Independent Objects/Files

Network wide, transparent filesystem Programs can read/write files regardless of

execution location minimal change: 1-2 lines of code per file

Benefits: transparent execution, sharing with others

Remote execution

non-Legion binaries, shell scripts, whatever copies binary, data files simplest way to do parameter space studies

legion_register_program <unix_path> <legion_path>legion_run <legion_path> <parameters>legion_run_multi -f spec <legion_path>

Binary Mangement

More than just remote file access Compile your code for each architecture

possibly using legion_run... don’t have to log in

Upload your binaries to Legion space legion_register_binary legion_path Unix_path arch register repeatedly for different architectures

Legion moves the right binary to wherever it’s needed caching

Parallel Computing

MPI and PVM requires relinking to our libraries legion_mpi_run -n 4 my_program

Parallel programs in Java, C++ (MPLC extension), BFS (Fortran dataflow!)

We expose the runtime system to compile writers and tool builders

Legion Status -- 1.3

Testbeds run cross-country continuously Glues our testbed shared-nothing cluster

together Transparent files, remote execution, MPI, one

style of security are here Scheduling in beta, fault-tolerance a work in

progress Deployment: NPACI, DoD MSRCs, NASA,

DoE

Summary

Metasystems are coming -- and can enhance your productivity

Legion is on its way to deployment Legion’s costs (code changes; your time)

balanced by its benefits http://www.cs.virginia.edu/~legion/