Post on 18-Jan-2016
description
EGEE-III INFSO-RI-222667
Enabling Grids for E-sciencE
www.eu-egee.org
EGEE and gLite are registered trademarks
Improving ENOC ’s support for CODs
COD-18, Abingdon, UK
Guillaume Cessieux (CNRS, IN2P3-CC / EGEE SA2)
2008-12-03
Enabling Grids for E-sciencE
GCX
Outlines
• ENOC and COD interactions
• Status of work around network trouble tickets
• DownCollector– Assessment– Review of last 12 months
• Proposal to handle DownCollector’s troubles– Processes– Tools’ improvements
COD18 2008-12-03 2
Enabling Grids for E-sciencE
GCX
EGEE Network Operating Centre
• ENOC– Aiming to provide support for:
• Sites
• ROCs
• CODs– Hard to get feedbacks and requirements from SA1
“Two different worlds”...
– Now real-life background with better vision
~ 0.5 FTE in EGI, main changes MUST happen before– Drop unnecessary things, focus on useful– Network support wider role than the ENOC in EGI
COD18 2008-12-03 3
Enabling Grids for E-sciencE
GCX
Current status with COD
• Only DownCollector seems now to be used by CODs [ https://ccenoc.in2p3.fr/DownCollector/ ]
– Very efficient integration in COD’s dashboard
• SA2 is willing to know how to better serve CODs around network support– Regarding processes
Balance between wait and see & over-engineered things
– Regarding tools and integration DownCollector, other tools, CIC dashboard, alarms …
• Use background to sketch wise, realistic and useful processes and tools
COD18 2008-12-03 4
Enabling Grids for E-sciencE
GCX
Around network trouble tickets (1/2)
• Currently TTdrawlight [ https://ccenoc.in2p3.fr/TTdrawlight/ ]
– Repository of network trouble tickets– Not enough accurate & hard to be used efficiently
• Network trouble tickets are not a panacea– « Главным образом сеть вниз. Будет вверх скоро »
Targeted for a local community
– But often the only operational information available…
• Strong privacy issues to share network trouble tickets– No filtering of sensible information delivered (school, military…)– Fear of comparison and competition– Knowledge database of networks trouble tickets compromised?
COD18 2008-12-03 5
~ « Main router is down. Will be up soon. »
Enabling Grids for E-sciencE
GCX
Around network trouble tickets (2/2)
• 19 NRENs currently sending their tickets to the ENOC– EGEE relies on networks from ~ 50 NRENs + GÉANT2
We cover ~80% of European Grid sites
– 2800 e-mails for 900 tickets/month– Really hard to deal with meaning of tickets (location, duration...)
• Standardisation of network TT?– Can enable painless, accurate and automatic management of TT– Strong advances in this domain but hard to promote to NRENs
• Situation to be sorted out between NRENs & SA2– Solve centralisation, accuracy and exposure of TT– Then tools will easily follow
COD18 2008-12-03 6
Enabling Grids for E-sciencE
GCX
Around network monitoring
• Connectivity addressed with DownCollector– But not performance
• Hard to have information on end-to-end performances– Require to go on network paths and devices details
300 certified sites, 50 NRENs... Inhomogeneous domains
– Network is shared, should be monitored once and not at project level Slowly converging toward perfSONAR – not yet mature
• EGEE Network troubleshooting tool upcoming– Lightweight package from SA2– Prototype around January 2009
COD18 2008-12-03 7
Enabling Grids for E-sciencE
GCX
DownCollector (1/3)
• Now a key tool reporting TCP listening of Grid nodes
• 2 minutes accuracy~ 2600 nodes pooled– Often first to detect some failures
• GOCDB Scheduled downtimes are managed– Troubles not reported for sites in scheduled downtimes
COD18 2008-12-03 8
Enabling Grids for E-sciencE
GCX
DownCollector (2/3)
• A trouble = All Grid hosts of a site unreached– To avoid measuring host availability
• Network checkpoint = border router– Demarcation point for ENOC’s responsibility– Checked during trouble
• Three kinds of troubles1. OFF-SITE: Network checkpoint NOT reached
Fault in: WAN, MAN, NREN, GÉANT2, ISP...
2. ON-SITE: Network checkpoint reached LAN, power, software ...
3. UNKNOWN: No clear and reliable checkpoint, but site in trouble
COD18 2008-12-03 9
NREN X
GÉANT2
checkpoint
OF
F-S
ITE
ON
-SIT
E
Enabling Grids for E-sciencE
GCX COD18 2008-12-03 10
• Is it trustable or biased?– If failure reported from ENOC is failure from entire infrastructure?
For ON-SITE troubles: ~YES
– What about French sites reached without using GÉANT2? remote probes?
– 2 instances of DownCollector? ~NO
DownCollector (3/3)
RENATER
GÉANT2
NREN X
ENOC
French site
Foreign site 1
Foreign site 2
Router A Router B
Checkpoint for site 1
Enabling Grids for E-sciencE
GCX
Troubles detected by DownCollector
COD18 2008-12-03 11
• 54% of detected problems are ON-SITES
Min Max AVG % AVG
OFF-SITES 157 354 248 28%
ON-SITE 273 615 467 54%
Unknown 59 167 106 12%
• Scope– (300 certified sites)– Last 12 months
Number of troubles per month:
Number of troubles
Troubles are not concentrated on few sites!
Enabling Grids for E-sciencE
GCX
Troubles’ durations
COD18 2008-12-03 12
Noticed resolution time OFF-SITE ON-SITE<= 5 min 36,08% 42,71%
> 5 min and <= 30 min 45,04% 36,36%> 30 min and <= 1h 6,48% 6,37%
> 1h and <= 4h 8,17% 7,67%> 4h and <= 12h 3,14% 3,32%> 12h and <= 1d 0,79% 1,93%
> 1d 0,31% 1,65%
• 80% solved within 30 min Pareto’s law
• The others– OFF-SITE
Avg 45 troubles/month
– ON-SITE Avg 85 troubles/month
Last 12 months troubles’ dispatching:
Enabling Grids for E-sciencE
GCX
Yearly sum of downtimes per sites
• N.B: unscheduled downtime• Better: 4 minutes down• Worst: 64 days (PPS…)
COD18 2008-12-03 13
85% of sites <4d of downtime/year
= 98.90% reachability/year
46 sites
Last 12 months total downtime for site 46: 4d OFF-SITE, 17d ON-SITE
164 sites have less than 1d of downtime during last 12 months
Enabling Grids for E-sciencE
GCX
First assessment
• Networks are quite reliable– Few long outages on resilient transit networks– ON-SITE troubles are important things– 30 minutes seems a wise threshold– DownCollector seems reliable and trustable enough
• Automatic management of network TT currently not reliable
• Currently few interactions SA2 / CODs
• This was discussed with pole1 for improvements– Thanks to them for feedbacks, results are following
COD18 2008-12-03 14
Enabling Grids for E-sciencE
GCX
Proposal for troubles handling
• Map troubles handling around the three kinds of problem from DownCollector
COD18 2008-12-03 15
ON-SITE OFF-SITE UNKNOWN
Create alarms in COD dashboard from DownCollector
Alarm hierarchy and masking
GGUS Tickets created by CODs to sites after 30 minutes
Not ENOC’s responsibility Currently not managed
ENOC’s responsibility
Allow flagging particular outage for focusing on them (cf. next slide)
Threshold 30 minutes?
Only inform
Try to reduce number of unknown trouble
Enabling Grids for E-sciencE
GCX
OFF-SITE troubles handling
• ENOC’s responsibility – devolving trouble resolution to NRENs/GÉANT2
• Targeted key information: expected end date– Hard to get…
• Enable marking of particular outages– Maybe then automatically create a ticket into ENOC’s helpdesk
(GGUS) to exchange information with COD
COD18 2008-12-03 16
ENOC please follow that
#GGUS-41
#GGUS-42
#GGUS-43
Enabling Grids for E-sciencE
GCX
Proposal for tools (1/2)
• ENOC to work with sites to improve some network checkpoints– Reduce number of unknown troubles (~ 12%, ~106/month)– 351 sites in database: 32 (9%) without usable checkpoint
[ https://ccenoc.in2p3.fr/DownCollector/?v=list_headnodes ]
• ENOC’s bar in COD dashboard
COD18 2008-12-03 17
Trouble OFF-SITE ON-SITEUNKNOWN
Now-5h
Enabling Grids for E-sciencE
GCX
Proposal for tools (2/2)
• Notification from DownCollector to site admins for long-standing outage (15 or 30 minutes?)– Integration to Nagios not sufficient?– Existing DownCollector feature: Subscribe to troubles
[ https://ccenoc.in2p3.fr/DownCollector/?v=subscription ]
Released with EGEE broadcast on 2008-07-16 34 sites, 26 distinct emails have currently registered Noticed problem: E-mails not reaching disconnected sites… No threshold implemented yet
COD18 2008-12-03 18
1.5 - select threshold
Enabling Grids for E-sciencE
GCX
Actions list for tools
• ENOC1. DownCollector
Improve checkpoints Add threshold to subscribe feature?
2. Allow flagging important network outages and study scheme to exchange around (GGUS ENOC’s helpdesk...)
3. Provide ENOC’s bar
• CIC portal1. Manage networks alarms & alarms masking
2. Integrate ENOC’s bar
COD18 2008-12-03 19
Enabling Grids for E-sciencE
GCX
Conclusion
• Its really going ahead
• Some implementation details to sort out– Scalability, regionalisation– Right now or waiting your next model (alarm DB, R-COD etc.)?– CIC portal & ENOC
priorities, manpower and roadmap
• Other ideas, feedbacks etc. always welcome– Help designing the network support you need
COD18 2008-12-03 20
Enabling Grids for E-sciencE
GCX
Questions?
COD18 2008-12-03 21