Gemini OSU - UKLC Update

Annie Griffith Annie Griffith December 2007December 2007

Gemini OSU - UKLC UpdateGemini OSU - UKLC UpdateGemini OSU - UKLC UpdateGemini OSU - UKLC Update

Discussion Items

Focus of discussions will be around the system elements of the recent Gemini incident :-

Summary of findings to date

Review of learning points

Lessons learned applicable to UKLTR

Open discussion

Overview of events

21st Oct – upgraded system implemented – API errors identified

22nd Oct – issue with Shipper views of other shippers’ data shipper access revoked

24th Oct – Code fix for data view implemented internal National Grid access only

26th Oct – external on-line service restored 1st Nov – hardware changes implemented to external service 2nd Nov – API service restored

Further intermittent outage problem occurring to APIs

5th Nov – Last outage on API service recorded at 13:00 Root cause analysis still underway

Summary - Causes

Two problems identified

1. Application Code Construct – associated with high volume instantaneous concurrent usage of same transaction type. Fix deployed 05:00 23/10/07

2. API error – associated with saturation usage displaying itself as “memory leakage”, builds up over time and eventually results in loss of service. Indications are that this is an error with a 3rd party system software product. Investigations continuing

Fixes since Go-live

Since 4th November 10 Application defects All minor All fixed

No outstanding application errors

Gemini OSU Testing

Extensive testing programme 2 months integration and system testing 6 weeks OAT Performance Testing

Volume testing of 130% of current user load

8 weeks UAT 4 weeks shipper trials (voluntary)

3 participants

7 weeks dress rehearsal Focus was on actions needed to complete the technical hardware upgrade

across multiple servers and platforms

Testing Lessons Learnt

UAT each functional area tested discretely Issues around concurrent usage unknown and therefore not

specifically targeted for testing “Field” testing of system under fully loaded conditions may have

highlighted this problem, but this is not certain.

OAT Although volume and stress testing completed successfully, reliability

testing/soak testing over a prolonged period not undertaken

Other Observations

Communications during incident

Undersold the scale of the change

Engagement - right individuals/forums ?

Planning for failure…..as well as success

UKLTR – What’s different ?

Main workhorse of the system is the batch processing Predictable transaction volumes Far easier to replicate load and volume testing Easy to verify outputs

Shipper interaction is batch driven

Low volume of on-line users

Doesn’t have same level of real-time/instantaneous transaction criticality

Ability to do more verification following cut-over before releasing data from upgraded system to the outside world.

UKLTR – Lessons to be applied

Plan for failure Differing levels, problems vs. incident Technical and resource planning

Fully prepared Incident Management procedure established in advance and understood by all parties Escalation routes Communications mechanisms

Status Communications to be issued during outage period Milestone Updates ? Who to ?

Fall-back options Old system provides straight forward option However, once interface data has been propagated to other systems will be in

a “fix-forward” situation.

Discussion

?

Gemini OSU - UKLC Update

Documents

Transcript of Gemini OSU - UKLC Update