Review report, Vote on APIs Quarterly report, and SW release Al Geist June 5-6, 2003 Chicago, IL.
-
Upload
lorraine-goodman -
Category
Documents
-
view
214 -
download
0
Transcript of Review report, Vote on APIs Quarterly report, and SW release Al Geist June 5-6, 2003 Chicago, IL.
Review report, Vote on APIsQuarterly report, and SW release
Review report, Vote on APIsQuarterly report, and SW release
Al GeistJune 5-6, 2003
Chicago, IL
Coordinator: Al Geist
Participating Organizations
ORNLANLLBNLPNNL
PSCSDSCIBM
SNLLANLAmesNCSA
CrayIntelUnlimited Scale
Participating OrganizationsParticipating Organizations
External reviewers want to see more vendors involved
Have begun working with Don Mason and John Lawson to set up a presentation to a vendor forum.
Will need your participation when logistics are known
IBMCrayIntelUnlimited Scale
Scalable Systems SoftwareScalable Systems Software
Participating Organizations
ORNLANLLBNLPNNL
NCSAPSCSDSC
SNLLANLAmes
• Collectively (with industry) define standard interfaces between systems components for interoperability
• Create scalable, standardized management tools for efficiently running our large computing centers
Problem
Goals
Impact
• Computer centers use incompatible, ad hoc set of systems tools
• Present tools are not designed to scale to multi-Teraflop systems
• Reduced facility mgmt costs.• More effective use of machines
by scientific applications.
ResourceManagement
Accounting& user mgmt
SystemBuild &Configure
Job management
SystemMonitoring
www.scidac.org/ScalableSystemsTo learn more visit
Scalable Systems Software CenterFebruary 24-25Chicago ILL
Review of Last MeetingReview of Last Meeting
Details inMain project notebook
Progress Reports at Feb. mtgProgress Reports at Feb. mtg
Al Geist – preparation for external review, SciDAC PI meeting, posters, and demos
Working Group Leaders –What areas their working group is addressing Progress report on what their group has done Present problems being addressed Next steps for the group Discussion items for the larger group to consider
Discussion of Prototype ComponentsPrep for external review demo
Slides can be found in Main Notebook
Consensus and Voting:Consensus and Voting:
None at last meeting.
Something we need to start doing again.
Scalable Systems Software Center
February-June
Progress Since Last MeetingProgress Since Last Meeting
SciDAC PI mtg – all 50 projectsSciDAC PI mtg – all 50 projects
March10-11, 2003 – Napa CaliforniaAttending for Scalable Systems – Al Geist, Brett Bode
20 minute talk – presented by AlScalable Systems, CCA, PERC, SDM
Poster Presentation
External SciDAC Review mtgExternal SciDAC Review mtg
March12-13, 2003 – Napa CaliforniaAttending for Scalable Systems – Al Geist, Brett Bode, Paul Hargrove, Narayan Desai, Mike Showerman. (Rusty)
Four ISIC Projects were reviewed separately – Scalable Systems, CCA, PERC, SDM
External review panel (9 members) Bob Lucas, Jim McGraw, Jose Munoz, Lauren Smith, Richard Mount, Ricky Kendall, Rod Oldehoeft, and Tony Mezzacappa John Grosh
Day 1 – We had 1 ¾ hours to present projectDay 2 – We got grilled by panel for 1½ hrs
External Review mtg AgendaExternal Review mtg Agenda
Wednesday, March 12
7:45 Welcome, charge to reviewers 8:15 Plenary session for Common Component Architecture ISIC10: 00 Break10:15 Plenary session for Scalable Systems Software ISIC Al Geist gives 1 hr project overview, vision, goals Last 45 minutes team gives demos, answer questions12:00 Reviewer caucus 12:15 Lunch 1:15 Plenary session for Scientific Data Management ISIC 3:00 Break 3:15 Plenary session for Performance Engineering ISIC 5:00 Reviewer caucus 5:30 Adjourn
Grid Interfaces
Accounting
Event Manager
ServiceDirectory
MetaScheduler
MetaMonitor
MetaManager
SchedulerNode StateManager
AllocationManagement
Process Manager
UsageReports
Meta Services
System &Job Monitor
Job QueueManager
NodeConfiguration
& BuildManager
Standard XML
interfaces
Working Components and Interfaces (bold)
authentication communication
Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite
Checkpoint /Restart
External Review DemoExternal Review Demo
Validation & Testing
HardwareInfrastructure
Manager
External Review mtg Agenda Day 2External Review mtg Agenda Day 2
Thursday, March 13 8:00 Meetings between reviewers and ISIC members A. Common Component Architecture B. Scalable Systems Software (Jim McGraw) 9:45 Break 10:00 Meetings between reviewers and ISIC members C. Scientific Data Management D. Performance Engineering11:45 Reviewer Caucus/End of ISIC Reviews12:15 Lunch (McGraw gives initial assessment)
Team brain storms on the response to the initial commentsThese sent to McGraw that same day.
External Review Initial CommentsExternal Review Initial Comments
Response on top two issues: 1. Lack of large-scale testbed for scalable systems
Mike Showermann of NCSA says they will have a 900 processor system by late summer that Scalable Systems software could be tested on. He also said there are plans to get an additional 1300 node system. CPlant has also been thrown out as a possible large scale (~1200 processor) test platform.
2. Get more vendors involved and more "buy-in"
I will redouble my efforts to get SGI to get back involved in Scalable Systems. HP has been a tough nut to crack, both PSC and PNNL have tried to get them to engage. I'll see if PSC and PNNL are willing to try again. By late summer we will have a beta release of the suite that I can use to demonstrate to vendors our progress and advantages of going the scalable systems path.
Official External Review ReportArrived in May 2003Official External Review ReportArrived in May 2003
Organizationally the project has developed effective working unitsThe project appears to be on schedule for technical issuesIt has made several noteworthy accomplishments
Recommendations:The two greatest obstacles to the success of this project are the availability of an adequate testbed for proving scalability of the interface design and the willingness of vendors to adopt the design for future systems
Secondary considerations
Investigate relationship with CCAInvestigate File system planInvestigate security planImportance of fault tolerance at smaller cluster sizesDevelop test workloads
Five Project Notebooks filling upFive Project Notebooks filling up
A main notebook for general information
And individual notebooks for each working group
• Over 270 total pages – few added since last meeting
• Add Telecon meeting notes even if short
• Have had several web server problems this quarter
Get to all notebooks through main web site www.scidac.org/ScalableSystems
Click on side bar or at “project notebooks” at bottom of page
Bi-Weekly Working Group TeleconsHave been sparse since March review Bi-Weekly Working Group TeleconsHave been sparse since March review
Resource management, scheduling, and accounting
Tuesday 3:00 pm (Eastern) 1-800-664-0771 keyword “SSS mtg”
Validation and Testing (hasn’t met since last year)
Wednesday 1:00 pm (Eastern) 1-877-540-9892 mtg code 999157
Proccess management, system monitoring, and checkpointing
Thursday 1:00 pm (Eastern) 1-877-252-5250 mtg code 160910
Node build, configuration, and information service
Thursday 3:00 pm (Eastern) 1-888-469-1934 mtg code (changes)
Major Topics this MeetingMajor Topics this Meeting
MICS request for Highlights – Fred sent out a call for 2 page highlights due to MICS by June 12. Has anyone responded? I sent in our 2 pager
Response to Reviewers Report – need feedback from the team on our official response to the points in the report
Quarterly Report Due – would like to get one to Fred by end of June. Will need text from WG leaders.
Formal API presentations and voting - it is that time in the project when we should be settling on some APIs.
SC2003 Tutorial - proposal submitted at Fred’s request. Have a software suit released before SC2003
Agenda – June 5Agenda – June 5
8:30 Al Geist – Project Status. Qtr report coming up and External review report 9:00 Matt Sottile – Using Scalable Systems API Working Group Reports 9:30 Scott Jackson – Resource Management10:30 Break11:00 Erik Debenedictis – Validation and Testing 12:00 Lunch (on own - walk to cafeteria) 1:00 Paul Hargrove – Process Management + Rusty slides 1:30 Craig Stefan – Warehouse Monitoring framework 2:00 Narayan Desai – Node Build, Configure Stephen Scott – OSCAR release with SSS inside 3.00 Break 3:30 Presentation of formal APIs for discussion 5:00 Rusty, Scott, Narayan, Paul? 5:30 Adjourn Working groups may wish to prepare material for voting Friday
Agenda – June 6Agenda – June 6
8:30 Discussion, proposals, straw votes
Discussion of review report API proposals for envelope
10:30 Break11:00 Al Geist – Summary Qtr Report. next meeting date: .
12:00 meeting ends
Meeting notesMeeting notes
Matt Sottile – bproc (bstat_sss) software integrated with cluster status componentGood (was able to do it in a day) bad (shouldn’t take 8 hours) ugly (python)Example with distribution didn’t help much.XML isn’t well documentedBut it is a prototype distribution so some of these issues are expectedMajor gripes had to write code for Socket code and XML parsing and creationThese should be APIs – He then talks about Linux TCP being a hackXML parsing – the schema and associated parser are intimately relatedNoted that code had some constructs that could be made more robustCCA thoughts on relation to our projectHis expertise is language interoperability and runtime frameworksLaw of least surprises. Consistency is goodInsulate developers from the support structureComponents the wheel everyone continues to reinvent But is SSS there aren’t components – just XML and wire protocolCCA provides: SIDL, standard interfaces to runtimes – CCAFFEINE, CCAT, Dune, …Suggests: Could try to leverage CCA messaging layer, Define interfaces in SIDL, Build services that
conform to SIDL. CCA provides no securityConcentrate on interfaces and problem of mapping concrete services into the interface space of SSSConclusion: Clean up APIs to minimize possibilities for version skew. Too late to adopt CCA modelOverall things worked – a good accomplishmentShowed demo
Meeting notesMeeting notes
Scott Jackson – RM wg reportProgress – SSS front end created for Qbank, Soon Release v1.0 Open PBS, Maui, and Qbank all with SSS XML front end.Created Job Object specification v2.0Created SSSRMAP v 2.0 – in notebookScheduler progress: 40% of clients now using SSSRMAP, supports dynamic reservations to support growing and shrinking MPI jobsSecurity- support for a user specified keyfileFault tolerance – implemented a fallback serverEase of use- initial web-GUI developedOueue Manager Progress – updated service directory and event manager interfacesAccounting and allocation manager progress – GOLDAll functionality of Qbank plus support for deposits, support for hierarchical accounts, support for refunds, guaranteed quotes, negotiation of optionsAdded role-based access control, authentication, and encryptionGot PNL OK to open source as BSD, sent email to Fred for DOE OKWill talk about SSSRMAP v2 details this afternoon, in particular interfaces to other working group components.
Meeting notesMeeting notes
Will McLendon – Validation and testing WG updateStrategies for distributed runtime system testing – users expect high qualityESP benchmark – out of NERSC used in procurement to predict the effectiveness of a system before it is purchased. Could be used to test the SSS suite Consider putting ESP on the SSS testbed(s)APItest – most of the work this quarter is going on here.Recoded in Python for portability (C++ version had portability problems)Integrated into SSSlib as part of the distributionTests well under develoment for ssslib componentsStatus slide shows working, prototype, and planned featuresBlack box testing – does component support the APIWhite box testing – coverage tests, internal states of component, unreachable statesEncoding XML inside XML is a problemRan real demos of APItest running on Chiba CityMySQL database support – used to store raw test resultsWork still to do – see status slide
Meeting notesMeeting notes
Paul Hargrove – used my laptop for presentation – see slidesCheckpoint/restart progress is stalled because person has been pulled off our project by Bill McCurdy to work on NERSC projects.
Craig Steffan – Warehouse Monitoring Software InfrastructureDescribes the old way cluster monitor worked and scalability issues with itPresents new design – each node is a peer each can be root of subtreeThey can be grouped into “information storehouses” w/ multiple sources and sinksShowed how it can be used to monitor multiple clusters in a compute centerInformation storehouse infrastructure is done.Sources and Sinks – next step will be to write simple ones, then more complexLots of questions about the design. Good answers from CraigOnly update changing informationIn next 6 months - Self balancing systems by tuning update intervals andMessage passing to request information through the tree
Meeting notesMeeting notes
Narayan Desai – BCWG reportAll APIs changed to restriction syntax – draft specService directory – new schema and new implementationEvent manager – sameSSSlib – more wire protocol modules – SSL, SSSRMAPOSX port in progressBuild and configuration now has diagnostic servicesHardware infrastructure issues discussed – what does system look like right now?Open issues
specification formats – what tests does it need to passrelease formats – see OSCAR slidesXML interface formatsmultiple implementations
Thomas Naughton – SSS deployment using OSCARHow users download and install SSS suite? Propose leverage OSCAR frameworkOSCAR core – SIS, C3, ODA, Env-SwitcherOSCAR package facility – RPMs and other package classesOSCAR package loaderSeems to be consensus of group to do this for SC2003
Meeting notesMeeting notes
Rusty Proposal – an API for the Process Mangement ComponentHe says the material is not quite in the form needed to vote on, but here is the process we should follow to vote in standard APIsVoting should be on a document that has
descriptionsexamples both simple and complexDetails of XML schema
See his slides for details of his process manager interface proposalMuch discussion.
Scott Jackson – SSSRMAP v2 proposal Have taken an object oriented approach to jobs and attributesGoes over Basic examples in proposal (found in RM notebook)Discuss of the differences between RM Schema and BC Schema Part of the difference is the incorporation of securityAnother part is functional vs object orientedDiscussion of outer (envelope, signature, body) framing and put in SSSlib (vote)
Meeting notes Day 2Meeting notes Day 2
Al Geist – action items
1. Need Working group leaders to send me a couple pages for the Qtr rptStatus and Progress from Feb-June
2. Any comments on points in the external reviewers report.Paragraph or two is fine.
Meeting notesMeeting notes
Narayan Desai – Restriction syntax proposalGoes over basic command syntax where an attribute can be “*” wildcardedGoes over complex command syntaxMatching semantics – especially for wildcardsBenefits of this approach – compact, powerful, simple syntax, validatable, data ownership is explicitUses MySQL on the backend
This syntax has Constructive Normal FormDiscussion that need to add negation before this is trueWhat about regular expression support? – More discussion on how to do various things like “join” and “union”
Discussion of the Communication Infrastructure Spec Draft (hardcopy handed out)We should be able to hardwire components together.Existence of static file to define where things are – may just have service directoryUunix Domain socket protocol for SMP serversVote – accept the spec pending Yes 15, No, 0 abstaning 0
Meeting notesMeeting notes
Paul – Discusses the idea of hiding the socket code in a libraryMatt says he would be happy to contribute such a server.
Discussion of scalability of the event manager – not a problem because the Number of meatballs does not increase with system size.Question about the Ordering of events notification
Scott – Lively discussion of the two XML variantsWhat are the strengths and weakness of both
Agreement for having common error objects with 3 digit codes and messagesMessage is human readable string. Two special ones 000 success 999 unknownStraw vote: 15 no 1 Abs 0
Add “supported scheme version” to Service directory Vote: 15 no 0 Abs 0
Next meeting September 9-10 in DC so Fred can attend?