NGOP Status and Plans Jim Fromm Marc Mengel Jack Schmidt May 2, 2006.

17
NGOP Status and Plans NGOP Status and Plans Jim Fromm Marc Mengel Jack Schmidt May 2, 2006

Transcript of NGOP Status and Plans Jim Fromm Marc Mengel Jack Schmidt May 2, 2006.

NGOP Status and PlansNGOP Status and Plans

Jim FrommMarc MengelJack Schmidt

May 2, 2006

Today’s talk…Today’s talk…

• Current Status• Farms/CMS/General Server split

• Recent Enhancements• Performance Tuning• Configuration File cleanup• CMS Enhancements.

• Future Enhancements

Current Status: Current Status: Farms/CMS/General Server Farms/CMS/General Server SplitSplit

• Goals:• Relieve bottlenecks by splitting out

the servers• Reduce configuration upgrade times• Provide groups with independence• Simplify the General server by

consolidating the two machines into one.

Current Status: Current Status: Farms/CMS/General Server Farms/CMS/General Server Split Split

• Bottlenecks• Farms and CMS Server hangs have

been non-existent since split.• General Server has experienced

occasional hangs, but to a lesser degree (still two systems).

• This goal has been successfully met.

Current Status: Current Status: Farms/CMS/General Server Farms/CMS/General Server Split Split

• Reduction of configuration upgrade times• Prior to the split, it took 2+ hours to

perform a system configuration upgrade when things went well. • Farms/CMS

• Takes less than 20 minutes to perform a configuration upgrade

• Less monitored elements per server• One status engine allowed for the removal of Warshall’s

algorithm for finding the transitive closure of a graph.

Current Status: Current Status: Farms/CMS/General Server Farms/CMS/General Server SplitSplit

• General Server• Configuration upgrade time reduced to

less than 30 minutes

• Recent parser optimizations will likely cut configuration upgrade times to ¼ .

• This goal has been successfully met.

Current Status: Current Status: Farms/CMS/General Server Farms/CMS/General Server Split Split

• Server Independence• Both CMS and Farms are up to speed with

doing their own configurations.• Upgrades are performed only when they need

them.• CMS (Gary Stiehr) has taken the initiative to

add several features.• Both groups have taken advantage of the

splitting of the cluster. • This goal has been successfully met.

Current Status: Current Status: Farms/CMS/General Server Farms/CMS/General Server SplitSplit

• General Server Consolidation• Not complete: still using two servers.• Doesn’t have the urgency as the

other items, and has been easy to put on the backburner.

• Need to make this a priority.

Recent Enhancements Recent Enhancements

• Performance Tuning• Preprocessor speedup.

• Marc Mengel implemented a change that improved performance of the XML preprocessor.

• NGOP preprocessor expands If_xxx/For_xxx tags• Was using 90% CPU on startup.• This was a known python performance issue.

• Stunning improvements on configuration upgrade times!

Recent EnhancementsRecent Enhancements

• Configuration File Cleanup• New "grand unified" XML Document

Type Description http://www.fnal.gov/docs/products/ngop/ngop_unified.dtd

• XML editor friendly • Works well with Merlin XML editor.

Merlin ScreenshotMerlin Screenshot

Recent EnhancementsRecent Enhancements

• CMS • No Downtimes: Modified to allow multiple

status engines roles to be defined for one set of definitions. This allows re-configuration on one while the other remains active, eliminating downtimes due to configuration upgrades.

• Used the SE API to create GUI that only shows “bad” things.

• Developed a generic plug-in agent that allows for a standard way of defining agents in the CMS system.

Future EnhancementsFuture Enhancements

• Dynamic Configuration Upgrades• By far the most difficult enhancement

to implement.• CMS needs have been addressed with

the multiple status engine solution.• With reduction of configuration

upgrade times coupled with the CMS workaround, this requirement becomes a very low priority.

Future Future Enhancements(Cont)Enhancements(Cont)

• CMS specific requested enhancements:• Marking Monitored Elements down across clusters.• Accelerate alarms based on time (i.e. yellow becomes red

after 8 hours)• Verify scalability to CMS planned growth.• Documentation upgrade

• General • Improvement of logging subsystem• Research UDP protocol issues

• Dropped packet issue seems under control with recent network tunings

• May need to do this anyway to address CMS requirements for scalability.

• Web/Swatch agents need DELAY/GAP parameters• “Anti” rules for Swatch agent

Future Future Enhancements(Cont)Enhancements(Cont)

• Wish List• Real dynamic configuration • SNMP agent• Email watcher

SummarySummary• Split of farms and CMS has been successful:

• Quicker reconfigs result in less downtime.• Splitting load has reduced NGOP hangs.• CMS and Farms groups are managing things on their

own timetable.• Need to consolidate General server to one machine

• New release is needed:• New CMS requests• Investigate potential scalability issues.• Improved logging• New and improved agents.• Revamp documentation and website.• Develop maintainable metrics

InformationInformation

• Main Site:http://www-isd.fnal.gov/ngop/ngop.html

• Documentation:• Users Guide- http://www-isd.fnal.gov/ngop/current/ngop_ug.htm• Admin Guide- http://www-sd.fnal.gov/ngop/current/ngop_admin_guide.htm