Review report, Vote on APIs Quarterly report, and SW release Al Geist June 5-6, 2003 Chicago, IL.

29
Review report, Vote on APIs Quarterly report, and SW release Al Geist June 5-6, 2003 Chicago, IL

Transcript of Review report, Vote on APIs Quarterly report, and SW release Al Geist June 5-6, 2003 Chicago, IL.

Review report, Vote on APIsQuarterly report, and SW release

Review report, Vote on APIsQuarterly report, and SW release

Al GeistJune 5-6, 2003

Chicago, IL

Coordinator: Al Geist

Participating Organizations

ORNLANLLBNLPNNL

PSCSDSCIBM

SNLLANLAmesNCSA

CrayIntelUnlimited Scale

Participating OrganizationsParticipating Organizations

External reviewers want to see more vendors involved

Have begun working with Don Mason and John Lawson to set up a presentation to a vendor forum.

Will need your participation when logistics are known

IBMCrayIntelUnlimited Scale

Scalable Systems SoftwareScalable Systems Software

Participating Organizations

ORNLANLLBNLPNNL

NCSAPSCSDSC

SNLLANLAmes

• Collectively (with industry) define standard interfaces between systems components for interoperability

• Create scalable, standardized management tools for efficiently running our large computing centers

Problem

Goals

Impact

• Computer centers use incompatible, ad hoc set of systems tools

• Present tools are not designed to scale to multi-Teraflop systems

• Reduced facility mgmt costs.• More effective use of machines

by scientific applications.

ResourceManagement

Accounting& user mgmt

SystemBuild &Configure

Job management

SystemMonitoring

www.scidac.org/ScalableSystemsTo learn more visit

Scalable Systems Software CenterFebruary 24-25Chicago ILL

Review of Last MeetingReview of Last Meeting

Details inMain project notebook

Progress Reports at Feb. mtgProgress Reports at Feb. mtg

Al Geist – preparation for external review, SciDAC PI meeting, posters, and demos

Working Group Leaders –What areas their working group is addressing Progress report on what their group has done Present problems being addressed Next steps for the group Discussion items for the larger group to consider

Discussion of Prototype ComponentsPrep for external review demo

Slides can be found in Main Notebook

Consensus and Voting:Consensus and Voting:

None at last meeting.

Something we need to start doing again.

Scalable Systems Software Center

February-June

Progress Since Last MeetingProgress Since Last Meeting

SciDAC PI mtg – all 50 projectsSciDAC PI mtg – all 50 projects

March10-11, 2003 – Napa CaliforniaAttending for Scalable Systems – Al Geist, Brett Bode

20 minute talk – presented by AlScalable Systems, CCA, PERC, SDM

Poster Presentation

External SciDAC Review mtgExternal SciDAC Review mtg

March12-13, 2003 – Napa CaliforniaAttending for Scalable Systems – Al Geist, Brett Bode, Paul Hargrove, Narayan Desai, Mike Showerman. (Rusty)

Four ISIC Projects were reviewed separately – Scalable Systems, CCA, PERC, SDM

External review panel (9 members) Bob Lucas, Jim McGraw, Jose Munoz, Lauren Smith, Richard Mount, Ricky Kendall, Rod Oldehoeft, and Tony Mezzacappa John Grosh

Day 1 – We had 1 ¾ hours to present projectDay 2 – We got grilled by panel for 1½ hrs

External Review mtg AgendaExternal Review mtg Agenda

Wednesday, March 12 

7:45 Welcome, charge to reviewers 8:15 Plenary session for Common Component Architecture ISIC10: 00 Break10:15 Plenary session for Scalable Systems Software ISIC Al Geist gives 1 hr project overview, vision, goals Last 45 minutes team gives demos, answer questions12:00 Reviewer caucus 12:15 Lunch 1:15 Plenary session for Scientific Data Management ISIC 3:00 Break 3:15 Plenary session for Performance Engineering ISIC 5:00 Reviewer caucus 5:30 Adjourn

Grid Interfaces

Accounting

Event Manager

ServiceDirectory

MetaScheduler

MetaMonitor

MetaManager

SchedulerNode StateManager

AllocationManagement

Process Manager

UsageReports

Meta Services

System &Job Monitor

Job QueueManager

NodeConfiguration

& BuildManager

Standard XML

interfaces

Working Components and Interfaces (bold)

authentication communication

Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite

Checkpoint /Restart

External Review DemoExternal Review Demo

Validation & Testing

HardwareInfrastructure

Manager

External Review mtg Agenda Day 2External Review mtg Agenda Day 2

Thursday, March 13 8:00 Meetings between reviewers and ISIC members A.     Common Component Architecture B.     Scalable Systems Software (Jim McGraw) 9:45 Break 10:00 Meetings between reviewers and ISIC members C.     Scientific Data Management D.     Performance Engineering11:45 Reviewer Caucus/End of ISIC Reviews12:15 Lunch (McGraw gives initial assessment)

Team brain storms on the response to the initial commentsThese sent to McGraw that same day.

External Review Initial CommentsExternal Review Initial Comments

Response on top two issues: 1. Lack of large-scale testbed for scalable systems

   Mike Showermann of NCSA says they will have a 900 processor system by late summer that Scalable Systems software could be tested on. He also said there are plans to get an additional 1300 node system. CPlant has also been thrown out as a possible large scale (~1200 processor) test platform.

2. Get more vendors involved and more "buy-in"

   I will redouble my efforts to get SGI to get back involved in Scalable Systems. HP has been a tough nut to crack, both PSC and PNNL have tried to get them to engage. I'll see if PSC and PNNL are willing to try again.    By late summer we will have a beta release of the suite that I can use to demonstrate to vendors our progress and advantages of going the scalable systems path.

Official External Review ReportArrived in May 2003Official External Review ReportArrived in May 2003

Organizationally the project has developed effective working unitsThe project appears to be on schedule for technical issuesIt has made several noteworthy accomplishments

Recommendations:The two greatest obstacles to the success of this project are the availability of an adequate testbed for proving scalability of the interface design and the willingness of vendors to adopt the design for future systems

Secondary considerations

Investigate relationship with CCAInvestigate File system planInvestigate security planImportance of fault tolerance at smaller cluster sizesDevelop test workloads

Five Project Notebooks filling upFive Project Notebooks filling up

A main notebook for general information

And individual notebooks for each working group

• Over 270 total pages – few added since last meeting

• Add Telecon meeting notes even if short

• Have had several web server problems this quarter

Get to all notebooks through main web site www.scidac.org/ScalableSystems

Click on side bar or at “project notebooks” at bottom of page

Bi-Weekly Working Group TeleconsHave been sparse since March review Bi-Weekly Working Group TeleconsHave been sparse since March review

Resource management, scheduling, and accounting

Tuesday 3:00 pm (Eastern) 1-800-664-0771 keyword “SSS mtg”

Validation and Testing (hasn’t met since last year)

Wednesday 1:00 pm (Eastern) 1-877-540-9892 mtg code 999157

Proccess management, system monitoring, and checkpointing

Thursday 1:00 pm (Eastern) 1-877-252-5250 mtg code 160910

Node build, configuration, and information service

Thursday 3:00 pm (Eastern) 1-888-469-1934 mtg code (changes)

Scalable Systems Software Center

February 24-25, 2003

This MeetingThis Meeting

Major Topics this MeetingMajor Topics this Meeting

MICS request for Highlights – Fred sent out a call for 2 page highlights due to MICS by June 12. Has anyone responded? I sent in our 2 pager

Response to Reviewers Report – need feedback from the team on our official response to the points in the report

Quarterly Report Due – would like to get one to Fred by end of June. Will need text from WG leaders.

Formal API presentations and voting - it is that time in the project when we should be settling on some APIs.

SC2003 Tutorial - proposal submitted at Fred’s request. Have a software suit released before SC2003

Agenda – June 5Agenda – June 5

8:30 Al Geist – Project Status. Qtr report coming up and External review report 9:00 Matt Sottile – Using Scalable Systems API Working Group Reports 9:30 Scott Jackson – Resource Management10:30 Break11:00 Erik Debenedictis – Validation and Testing 12:00 Lunch (on own - walk to cafeteria) 1:00 Paul Hargrove – Process Management + Rusty slides 1:30 Craig Stefan – Warehouse Monitoring framework 2:00 Narayan Desai – Node Build, Configure Stephen Scott – OSCAR release with SSS inside 3.00 Break 3:30 Presentation of formal APIs for discussion 5:00 Rusty, Scott, Narayan, Paul? 5:30 Adjourn Working groups may wish to prepare material for voting Friday

Agenda – June 6Agenda – June 6

8:30 Discussion, proposals, straw votes

Discussion of review report API proposals for envelope

10:30 Break11:00 Al Geist – Summary Qtr Report. next meeting date: .

12:00 meeting ends

Meeting notesMeeting notes

Matt Sottile – bproc (bstat_sss) software integrated with cluster status componentGood (was able to do it in a day) bad (shouldn’t take 8 hours) ugly (python)Example with distribution didn’t help much.XML isn’t well documentedBut it is a prototype distribution so some of these issues are expectedMajor gripes had to write code for Socket code and XML parsing and creationThese should be APIs – He then talks about Linux TCP being a hackXML parsing – the schema and associated parser are intimately relatedNoted that code had some constructs that could be made more robustCCA thoughts on relation to our projectHis expertise is language interoperability and runtime frameworksLaw of least surprises. Consistency is goodInsulate developers from the support structureComponents the wheel everyone continues to reinvent But is SSS there aren’t components – just XML and wire protocolCCA provides: SIDL, standard interfaces to runtimes – CCAFFEINE, CCAT, Dune, …Suggests: Could try to leverage CCA messaging layer, Define interfaces in SIDL, Build services that

conform to SIDL. CCA provides no securityConcentrate on interfaces and problem of mapping concrete services into the interface space of SSSConclusion: Clean up APIs to minimize possibilities for version skew. Too late to adopt CCA modelOverall things worked – a good accomplishmentShowed demo

Meeting notesMeeting notes

Scott Jackson – RM wg reportProgress – SSS front end created for Qbank, Soon Release v1.0 Open PBS, Maui, and Qbank all with SSS XML front end.Created Job Object specification v2.0Created SSSRMAP v 2.0 – in notebookScheduler progress: 40% of clients now using SSSRMAP, supports dynamic reservations to support growing and shrinking MPI jobsSecurity- support for a user specified keyfileFault tolerance – implemented a fallback serverEase of use- initial web-GUI developedOueue Manager Progress – updated service directory and event manager interfacesAccounting and allocation manager progress – GOLDAll functionality of Qbank plus support for deposits, support for hierarchical accounts, support for refunds, guaranteed quotes, negotiation of optionsAdded role-based access control, authentication, and encryptionGot PNL OK to open source as BSD, sent email to Fred for DOE OKWill talk about SSSRMAP v2 details this afternoon, in particular interfaces to other working group components.

Meeting notesMeeting notes

Will McLendon – Validation and testing WG updateStrategies for distributed runtime system testing – users expect high qualityESP benchmark – out of NERSC used in procurement to predict the effectiveness of a system before it is purchased. Could be used to test the SSS suite Consider putting ESP on the SSS testbed(s)APItest – most of the work this quarter is going on here.Recoded in Python for portability (C++ version had portability problems)Integrated into SSSlib as part of the distributionTests well under develoment for ssslib componentsStatus slide shows working, prototype, and planned featuresBlack box testing – does component support the APIWhite box testing – coverage tests, internal states of component, unreachable statesEncoding XML inside XML is a problemRan real demos of APItest running on Chiba CityMySQL database support – used to store raw test resultsWork still to do – see status slide

Meeting notesMeeting notes

Paul Hargrove – used my laptop for presentation – see slidesCheckpoint/restart progress is stalled because person has been pulled off our project by Bill McCurdy to work on NERSC projects.

Craig Steffan – Warehouse Monitoring Software InfrastructureDescribes the old way cluster monitor worked and scalability issues with itPresents new design – each node is a peer each can be root of subtreeThey can be grouped into “information storehouses” w/ multiple sources and sinksShowed how it can be used to monitor multiple clusters in a compute centerInformation storehouse infrastructure is done.Sources and Sinks – next step will be to write simple ones, then more complexLots of questions about the design. Good answers from CraigOnly update changing informationIn next 6 months - Self balancing systems by tuning update intervals andMessage passing to request information through the tree

Meeting notesMeeting notes

Narayan Desai – BCWG reportAll APIs changed to restriction syntax – draft specService directory – new schema and new implementationEvent manager – sameSSSlib – more wire protocol modules – SSL, SSSRMAPOSX port in progressBuild and configuration now has diagnostic servicesHardware infrastructure issues discussed – what does system look like right now?Open issues

specification formats – what tests does it need to passrelease formats – see OSCAR slidesXML interface formatsmultiple implementations

Thomas Naughton – SSS deployment using OSCARHow users download and install SSS suite? Propose leverage OSCAR frameworkOSCAR core – SIS, C3, ODA, Env-SwitcherOSCAR package facility – RPMs and other package classesOSCAR package loaderSeems to be consensus of group to do this for SC2003

Meeting notesMeeting notes

Rusty Proposal – an API for the Process Mangement ComponentHe says the material is not quite in the form needed to vote on, but here is the process we should follow to vote in standard APIsVoting should be on a document that has

descriptionsexamples both simple and complexDetails of XML schema

See his slides for details of his process manager interface proposalMuch discussion.

Scott Jackson – SSSRMAP v2 proposal Have taken an object oriented approach to jobs and attributesGoes over Basic examples in proposal (found in RM notebook)Discuss of the differences between RM Schema and BC Schema Part of the difference is the incorporation of securityAnother part is functional vs object orientedDiscussion of outer (envelope, signature, body) framing and put in SSSlib (vote)

Meeting notes Day 2Meeting notes Day 2

Al Geist – action items

1. Need Working group leaders to send me a couple pages for the Qtr rptStatus and Progress from Feb-June

2. Any comments on points in the external reviewers report.Paragraph or two is fine.

Meeting notesMeeting notes

Narayan Desai – Restriction syntax proposalGoes over basic command syntax where an attribute can be “*” wildcardedGoes over complex command syntaxMatching semantics – especially for wildcardsBenefits of this approach – compact, powerful, simple syntax, validatable, data ownership is explicitUses MySQL on the backend

This syntax has Constructive Normal FormDiscussion that need to add negation before this is trueWhat about regular expression support? – More discussion on how to do various things like “join” and “union”

Discussion of the Communication Infrastructure Spec Draft (hardcopy handed out)We should be able to hardwire components together.Existence of static file to define where things are – may just have service directoryUunix Domain socket protocol for SMP serversVote – accept the spec pending Yes 15, No, 0 abstaning 0

Meeting notesMeeting notes

Paul – Discusses the idea of hiding the socket code in a libraryMatt says he would be happy to contribute such a server.

Discussion of scalability of the event manager – not a problem because the Number of meatballs does not increase with system size.Question about the Ordering of events notification

Scott – Lively discussion of the two XML variantsWhat are the strengths and weakness of both

Agreement for having common error objects with 3 digit codes and messagesMessage is human readable string. Two special ones 000 success 999 unknownStraw vote: 15 no 1 Abs 0

Add “supported scheme version” to Service directory Vote: 15 no 0 Abs 0

Next meeting September 9-10 in DC so Fred can attend?