Gemini OSU - UKLC Update
description
Transcript of Gemini OSU - UKLC Update
Annie Griffith Annie Griffith December 2007December 2007
Gemini OSU - UKLC UpdateGemini OSU - UKLC UpdateGemini OSU - UKLC UpdateGemini OSU - UKLC Update
Discussion Items
Focus of discussions will be around the system elements of the recent Gemini incident :-
Summary of findings to date
Review of learning points
Lessons learned applicable to UKLTR
Open discussion
Overview of events
21st Oct – upgraded system implemented – API errors identified
22nd Oct – issue with Shipper views of other shippers’ data shipper access revoked
24th Oct – Code fix for data view implemented internal National Grid access only
26th Oct – external on-line service restored 1st Nov – hardware changes implemented to external service 2nd Nov – API service restored
Further intermittent outage problem occurring to APIs
5th Nov – Last outage on API service recorded at 13:00 Root cause analysis still underway
Summary - Causes
Two problems identified
1. Application Code Construct – associated with high volume instantaneous concurrent usage of same transaction type. Fix deployed 05:00 23/10/07
2. API error – associated with saturation usage displaying itself as “memory leakage”, builds up over time and eventually results in loss of service. Indications are that this is an error with a 3rd party system software product. Investigations continuing
Fixes since Go-live
Since 4th November 10 Application defects All minor All fixed
No outstanding application errors
Gemini OSU Testing
Extensive testing programme 2 months integration and system testing 6 weeks OAT Performance Testing
Volume testing of 130% of current user load
8 weeks UAT 4 weeks shipper trials (voluntary)
3 participants
7 weeks dress rehearsal Focus was on actions needed to complete the technical hardware upgrade
across multiple servers and platforms
Testing Lessons Learnt
UAT each functional area tested discretely Issues around concurrent usage unknown and therefore not
specifically targeted for testing “Field” testing of system under fully loaded conditions may have
highlighted this problem, but this is not certain.
OAT Although volume and stress testing completed successfully, reliability
testing/soak testing over a prolonged period not undertaken
Other Observations
Communications during incident
Undersold the scale of the change
Engagement - right individuals/forums ?
Planning for failure…..as well as success
UKLTR – What’s different ?
Main workhorse of the system is the batch processing Predictable transaction volumes Far easier to replicate load and volume testing Easy to verify outputs
Shipper interaction is batch driven
Low volume of on-line users
Doesn’t have same level of real-time/instantaneous transaction criticality
Ability to do more verification following cut-over before releasing data from upgraded system to the outside world.
UKLTR – Lessons to be applied
Plan for failure Differing levels, problems vs. incident Technical and resource planning
Fully prepared Incident Management procedure established in advance and understood by all parties Escalation routes Communications mechanisms
Status Communications to be issued during outage period Milestone Updates ? Who to ?
Fall-back options Old system provides straight forward option However, once interface data has been propagated to other systems will be in
a “fix-forward” situation.
Discussion
?