SPRUCE S pecial PR iority and U rgent C omputing E nvironment

61
SPRUCE Special PRiority and Urgent Computing Environment Pete Beckman Argonne National Laboratory University of Chicago http:// spruce.teragrid.org/

description

SPRUCE S pecial PR iority and U rgent C omputing E nvironment. http://spruce.teragrid.org/. Pete Beckman Argonne National Laboratory University of Chicago. Modeling and Simulation are critical Part of Decision Making. I Need it Now!. - PowerPoint PPT Presentation

Transcript of SPRUCE S pecial PR iority and U rgent C omputing E nvironment

Page 1: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

SPRUCESpecial PRiority and

Urgent Computing Environment

Pete BeckmanArgonne National Laboratory

University of Chicago

http://spruce.teragrid.org/

Page 2: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 2University of Chicago Argonne National Lab

Modeling and Simulation are critical Part of Decision Making

Page 3: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 3University of Chicago Argonne National Lab

I Need it Now!

• Applications with dynamic data and result deadlines are being deployed

• Late results are useless Wildfire path prediction Storm/Flood prediction Influenza modeling

• Some jobs need priority access “Right-of-Way Token”

Page 4: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Severe weather example

Source: Kelvin Droegemeier, Center for Analysis and Prediction of Storms (CAPS), University of Oklahoma. Collaboration with LEAD Science Gateway project.

Example 1: Severe Weather Predictive Simulation from Real-Time Sensor Input

Page 5: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Medical Example

Example 2: Real Time Neurosurgical Imaging Using Simulation (GENIUS project - HemeLB)

Source: Peter Coveney, GENIUS ProjectUniversity College London

20482 x 682 cubic voxels,res: 0.46875 mm

5122 pixels x 100 slices,res: 0.468752 mm x 0.8 mm

trilinearinterpolation

Our graphical-editing tool

Data is generated by MRA scanners at the National Hospital for Neurosurgery and Neurology

Getting the Patient Specific Data

HemeLB

Page 6: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Flood modeling example

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Example 3: SURA Coastal Ocean Observing Program (SCOOP)

Source: Center for Computation and Technology, Louisiana State University

Page 7: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 7University of Chicago Argonne National Lab

• Build supercomputers for the application Pros: Resource is ALWAYS available Cons: Incredibly costly (99% idle) Example: Coast Guard rescue boats

• Share public infrastructure Pros: low cost Cons: Requires complex system for

authorization, resource management, and control

Examples: school buses for evacuation, cruise ships for temporary housing

How can we get cycles?

Page 8: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 8University of Chicago Argonne National Lab

Introducing SPRUCE

• The Vision: Build cohesive infrastructure that can provide

urgent computing cycles for emergencies

• Technical Challenges: Provide high degree of reliability Elevated priority mechanisms Resource selection, data movement

• Social Challenges: Who? When? What? How will emergency use impact regular use? Decision-making, workflow, and interpretation

Page 9: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 9University of Chicago Argonne National Lab

GETS USER

GETS USER

ORGANIZATION

GETS priority is invokedGETS priority is invoked““call-by-call”call-by-call”

Calling cards are in widespread use and easily understood by the NS/EP User, simplifying GETS usage.

GETS is a ‘ubiquitous’ service in the Public Switched Telephone Network. If you can get a DIAL TONE, you can make a GETS call.

Existing “Digital Right-of-Way” Emergency Phone System

Page 10: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 10University of Chicago Argonne National Lab

AutomatedTrigger

HumanTrigger

Right-of-WayToken

2

1

Event

First Responder

Right-of-Way Token

SPRUCE Gateway / Web

Services

SPRUCE Architecture Overview (1/2)Right-of-Way Tokens

Page 11: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 11University of Chicago Argonne National Lab

Conventional Job Submission

Parameters

!Urgent Computing

Parameters

Urgent ComputingJob Submission

SPRUCE JobManager

SupercomputerResource

Local Site Policies

Priority JobQueue

User Team Authentication

3

Choose aResource

4

5

SPRUCE Architecture Overview (2/2)Submitting Urgent Jobs

Page 12: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 12University of Chicago Argonne National Lab

Summary of Components

• Token & session management admin, user, job manager

• Priority queue and local policies

• Authorization & management for job submission and queuing

Cen

tral

Dis

trib

uted

, S

ite-lo

cal[configuration & policy]

[installed software]

[SPRUCE Server (WS interface, Web portal)]

Page 13: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 13University of Chicago Argonne National Lab

Internal Architecture

Central SPRUCE Server

SPRUCE User Services || Validation Services

Axis 2 Web Service Stack

Tomcat Java Servlet Container

Apache Web Server

MySQL

JDBCMirror

Client Interfaces

Web Portal orWorkflow Tools

PHP / Perl

AJAX

Client-Side Job Tools

Computing Resource:Job Manager & Scripts

Axis2

Java

PHP / Perl

SOAP

Request

SOAP

Reques

t

Future work

Page 14: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 14University of Chicago Argonne National Lab

Site-Local Response Policies: How will Urgent Computing be treated?

• “Next-to-run” status for priority queue wait for running jobs to complete

• Force checkpoint of existing jobs; run urgent job• Suspend current job in memory (kill -STOP); run urgent

job• Kill all jobs immediately; run urgent job

• Provide differentiated CPU accounting “Jobs that can be killed because they maintain their

own checkpoints will be charged 20% less”• Other incentives

Page 15: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 15University of Chicago Argonne National Lab

Emergency Preparedness Testing: “Warm Standby”

• For urgent computation, there is no time to port code Applications must be in “warm standby” Verification and validation runs test readiness

periodically (Inca, MDS) Reliability calculations Only verified apps participate in urgent

computing

• Grid-wide Information Catalog Application was last tested & validated on <date> Also provides key success/failure history logs

Page 16: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 16University of Chicago Argonne National Lab

• Purpose Given a set of distributed resources and a deadline,

how does one select the “best” resource on which to run? Analyze historical and live data to determine the

likelihood of meeting a deadline.

• Generate a bound for the total turnaround time Generate bounds for:

• File Staging (FT )

• Allocation time (e.g., queue delay) (AT )

• Execution time (ET )

Overall turnaround time = FT + AT + ET

Resource Advisor

Page 17: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 17University of Chicago Argonne National Lab

Historical Data

•Network bandwidth•Queue wait times•Warm standby validation•Local site policies

Live Data

•Network status•Job/Queue data

Urgent Severe

Weather Job

Deadline: 90 Min

Advisor

“Best” HPC

Resource

Trigger

Application-Specific Data

•Performance Model•Verified Resources•Data Repositories

!

Selecting a Resource

Page 18: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 18University of Chicago Argonne National Lab

• Some scenarios have definite early notice SCOOP gets hurricane warnings a few

hours to days in advance No need for drastic steps like killing

jobs

• HARC with SPRUCE for reservations Reservation made via portal will be

associated to a token Any user can use the reservation if

added onto that active token Can bypass local access control lists

Highly Available Resource Co-allocator (HARC)

Page 19: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 19University of Chicago Argonne National Lab

Bandwidth Tokens

Page 20: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 20University of Chicago Argonne National Lab

• Deployed and Available on TeraGrid - UC/ANL NCSA SDSC NCAR Purdue TACC

• Other sites LSU Virginia Tech LONI

Deployment Status

Page 21: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 21University of Chicago Argonne National Lab

• Ongoing work SPRUCE integration with Condor Policy mapping for resources and applications SPRUCE integration with network bandwidth WS-GRAM compatibility Notification system with triggers on token use INCA Q/A monitoring system for SPRUCE services

• Future Work Automatic restart tokens Aggregation, Extension of tokens, ‘start_by’ deadlines Encode (and probe for) local site policies Warm standby integration Data movement, network reservation, data storage Failover & redundancy of SPRUCE server

Roadmap

Page 22: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 22University of Chicago Argonne National Lab

• A world-wide system for supporting urgent computing on supercomputers

• Slow, patient growth of large-scale urgent apps• Expanding our notion of priority queuing,

checkpoint/restart, CPU pricing, etc• A standardized set of web services for “request VM”,

including all the complicated small bits DHCP, VLANS, DNS, local storage, remote storage

• For Capability:10 to 20 supercomputers available on demand

• For Capacity: Condor Flocks & Dynamic VMs provide availability of 250K “node instances”

Imagine…

Page 23: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 23University of Chicago Argonne National Lab

Partners

Page 24: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 24University of Chicago Argonne National Lab

Questions?Ready to Join?

[email protected]

http://spruce.teragrid.org/

Page 25: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Screen Shots

Running Urgent Jobs

Page 26: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 26University of Chicago Argonne National Lab

# spruce_sub urgency=red spruce_test.pbs No Valid Token found for user = beckman, aborting job submission

<validate token at SPRUCE gateway>

# spruce_sub urgency=red spruce_test.pbs 240559

# qstat JobId Name User S Queue ------ ---------- ------- - ------ 240552 Cylinder-1 gustav Q dque 240556 STDIN lgrinb Q dque 240559 spruce-job beckman R spruce

Direct SPRUCE Job Submission(No Grid Middleware)

Page 27: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 27University of Chicago Argonne National Lab

# grid-proxy-init Enter GRID pass phrase for this identity: ************* Your proxy is valid until: Sat Nov 18 03:21:30 2007

# cat globus_test.rsl <…> (resourceManagerContact = tg-grid1.uc.teragrid.org:2120/jobmanager-spruce) (executable = /home/beckman/spruce/mpihello) <…> (urgency = red) <…>

# globusrun -o -f globus_test.rsl

SPRUCE Job Submission via Globus

Page 28: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Screen Shots

LEAD Demo Shots

Page 29: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 29University of Chicago Argonne National Lab

1

Page 30: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 30University of Chicago Argonne National Lab

2

Page 31: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 31University of Chicago Argonne National Lab

3

Page 32: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 32University of Chicago Argonne National Lab

4

Page 33: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 33University of Chicago Argonne National Lab

5

Page 34: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 34University of Chicago Argonne National Lab

6

Page 35: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 35University of Chicago Argonne National Lab

7

Page 36: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 36University of Chicago Argonne National Lab

8

Page 37: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 37University of Chicago Argonne National Lab

9

Page 38: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Screen Shots

Managing Tokens

Page 39: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 39University of Chicago Argonne National Lab

1

Page 40: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 40University of Chicago Argonne National Lab

2

Page 41: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 41University of Chicago Argonne National Lab

3

Page 42: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 42University of Chicago Argonne National Lab

4

Page 43: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 43University of Chicago Argonne National Lab

5

Page 44: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 44University of Chicago Argonne National Lab

6

Page 45: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 45University of Chicago Argonne National Lab

7

Page 46: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 46University of Chicago Argonne National Lab

8

Page 47: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 47University of Chicago Argonne National Lab

9

Page 48: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Screen Shots

Admin Views

Page 49: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 49University of Chicago Argonne National Lab

1

Page 50: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 50University of Chicago Argonne National Lab

2

Page 51: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 51University of Chicago Argonne National Lab

3

Page 52: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 52University of Chicago Argonne National Lab

4

Page 53: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Screen Shots

Resource Advisor

Page 54: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 54University of Chicago Argonne National Lab

1

Page 55: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 55University of Chicago Argonne National Lab

2

Page 56: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 56University of Chicago Argonne National Lab

3

Page 57: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Screen Shots

SPRUCE Bandwidth

Page 58: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 58University of Chicago Argonne National Lab

1

Page 59: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 59University of Chicago Argonne National Lab

2

Page 60: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 60University of Chicago Argonne National Lab

3

Page 61: SPRUCE S pecial  PR iority and  U rgent  C omputing E nvironment

Urgent Computing - 61University of Chicago Argonne National Lab

4