Strengthening Business Continuity Through Strategic ... · Frequent reviews of the recovery...
Transcript of Strengthening Business Continuity Through Strategic ... · Frequent reviews of the recovery...
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.1
Strengthening Business ContinuityThrough Strategic Partnership
Cynthia L. Jenkins,
Lead BCM Analyst, CSG International
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.2
Who is CSG International?
First processed Cable TV statements in 1983 as part of FDR
Became independently owned in 1994
First publicly traded on NASDAQ in 1996 (CSGS)
Now the 2nd largest cable billing vendor in the world• Producing 65M pieces of mail each month
• One of the top 10 US mailers
• Over 3,600 employees in 24 countries
• Corporate Headquarters in Denver, CO
• Largest office in Omaha, NE
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.3
What Do CSG Products Do?
Manage Cable, Internet, and Telephone services
Support 90,000 Customer Service Representatives world‐wide
Compose, print, and mail statements monthly
Support on‐line bill pay
Support scheduling and routing of work orders
Mediate 8 Trillion Call Detail records annually
Enable detailed data analysis and data mining
Smart PhoneIVR/SMS
Call Center TechnicianKiosk Direct Mail & Statement
Web / E‐Mail
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.4
Some of CSG’s 500 + Clients
APACAPACEMEAEMEA AMERICASAMERICAS
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.5
One facility serves one purpose• Data Centers
• Output Solution Centers
• Office Facilities
Each facility has a unique BC/DR plan
BC/DR plans are written/maintained by CSG resources familiar with the facility, hardware, product, or service
CSG BCM Department • Organize large annual exercises
• Advise on and assist in determining BC/DR requirements for new products and services
• Track BC plan updates
• Provide liaisons for all the BC/DR partners/vendors
Corporate BC/DR Strategy
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.6
Background for Presentation
CSG North American Service Bureau• Based in the Omaha Data Center
• Contractual commitments to clients to return to operation within three distinct Recovery Time Objectives (RTO)
• The Minimum Acceptable Recovery Configuration (MARC) level means systems are up and ready for client use at the end of the RTO
• CSG Solutions are classified by MARC level which are:
- MARC I RTO – 48 Hours
- MARC II RTO – 3 – 7 Days
- MARC III RTO – 8 – 31 Days
• The recovery time starts when the disaster is declared by the CSG Emergency Management Team
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.7
Launched in 2001 ‐ Open Systems recovery support only
Mainframe support added in 2010
For first 10 years, CSG was responsible for Open Systems recoveries
Joined Sungard Managed Recovery Program (MRP) in 2012
CSG now views Sungard engineers as extensions of CSG resources at time of disaster
CSG – Sungard AS Partnership
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.8
Reasons for Adopting MRP
CSG Dual Recovery Strategy• North America Service Bureau Recovery – Sungard
• Internal CSG Products and Corporate Support – Tempe, AZ
Growing number of OS images to recover at Sungard• 6 different OS’s to recover
• Over 400 OS instances in all
Augment CSG Staff• CSG Staff responsible for both locations
• Additional trained staff needed to make RTOs
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.9
CSG Production Environment
Tightly Integrated Product Set• Most products dependent on Middleware and Mainframe to function
• Product dependencies change over time
Highly Configurable Environment – experiencing frequent change• Experiences an average of 30 – 40 changes per day
Large data recovery • Open Systems – 130 TB
• Mainframe 54 TB
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.10
Production Environment
Billing Engine
Data Warehouse
Video Interfaces
APIs
Job Routing and Scheduling
EBPP Solution
APIs to External Transactions
AdvancedProduct Catalog
External Video
Tran
sactions
Interfaces to other CSG products
Mainfram
e
Statement FilesFor Composition
Interface to Mail Tracking
Statement Images
Usage Processing
External Usage Files
ProvisionableServices
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.11
In Addition….
CSG invested in permanent infrastructure in the Sungard data center which includes:• NIM servers
• Kickstart servers
• Jumpstart servers
• NAS Storage
• SAN Storage
• VMware hosts and VCenter host
• Network support equipment
• Symantec replication appliances (Open Systems replicated data)
• EMC DLm (Mainframe replicated data)
• Communication circuits between Sungard and Omaha Data Center
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.12
Getting Started ‐ Knowledge Transfer
Initial MRP Workshop (March, 2012)• Tightly integrated products
• Standard build process for all OS’s
• Permanent infrastructure at Sungard
Spent over 2 days in discussions• At the end, Sungard Recovery Solution Architects had
everything to produce the first draft of recovery documentation
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.13
First Exercise – August, 2012
Huge Learning Experience for Both Sides• CSG
- Recovery instructions needed to be more explicit
- Needed to “let go” so Sungard engineers work
- Better collaboration tools needed
- Change management policy needed during the exercise
• Sungard
- More recovery document review needed
- Another Sun OS workshop needed
- Different exercise management structure was needed
- Agreement on better collaboration tools and change management
• Put action plans in place to adjust and correct each situation
• Tracked progress during weekly status meetings
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.14
Second Exercise – January, 2013
180 Degree Change!
Augmented staffing showed better results.
Sungard made changes in exercise management so engineers built servers and did not manage shifts.
CSG and Sungard jointly developed a Google Docs spreadsheet to convey build progress
Change management policy worked very well
The second Sun workshop paid great benefits• Held in CSG Denver office with Sun and Windows/Vmware SMEs
• End result was more detailed recovery documentation
• More emphasis on build verification checklist
- Last step Sungard performs
- First step CSG performs
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.15
Collaborative Google Document
Users update this document and others can see the updates real‐time. This enables CSG System Administrators to see system build progress without interfering with Sungard engineers’ work.
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.16
Exercise Change Management Policy
Sungard engineer identifies issue with recovery procedure
Shift lead is notified. Issue validated with Sungard SME before escalating.
Shift lead engages CSG SME on conference bridge. Sungard Test Manager and CSG Onsite Manager are also notified.
Issue is either clarified or a change is agreed on between Sungard recovery team and CSG SME.
Request for change is brought to the Sungard Test Manager and CSG Onsite Manager for Approval.
If approved, the change is documented in Sungard Observations & Recommendations and CSG Issue Tracking and noted as “single use” or “re‐usable”.
Sungard engineer is given approval to make change.
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.17
Third Exercise – August, 2013
Showed some significant challenges with the Sun builds
Joint decision to run a Proof of Concept exercise focusing on only Sun servers
The POC was held in January, 2014• Selected a representative set of Sun servers
• Brought a CSG Sun SME on‐site to work with Sungard engineers
• Experience was absolutely invaluable
- Fostered an understanding of the Sungard working environment
- CSG SME could explain build steps first hand
- Sungard engineers found areas to improve in the documentation
- CSG SME found processes on the CSG side needing adjustment
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.18
August, 2014
Largest exercise ever attempted by CSG• Close to 400 OS instances were built
• Over 200 OS instances were fully restored and integrated
• Finished 8 hours before the end of the exercise time.
• Extremely successful exercise.
Lessons Learned• Entire recovery timeline needed adjustment
• More robust tools needed to manage exercise (or recovery)
• First time using CSG off‐shore employees ‐ special access procedures are needed for them
• Investigate permanent infrastructure upgrades at Sungard
• Large recovery made possible with the help of Sungard MRP resources
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.19
Continual Change Management
Change Management (outside of exercises)• Omaha environment is very active – 30 – 40 changes daily
• As August exercise gets closer this becomes a very important topic
- Best practice – Changes after lockdown date
› Product already in production – leave it out
› New product not in production – include it
- Track Changes in Production Environment
- Keep the Sungard equipment reservations up to date!
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.20
CMDB Interface
CMDB Interface • The size and complexity of the Omaha environment mandates
interfacing the CSG and Sungard CMDBs
• CSG hired a college intern to develop queries extracting data from the CSG CMDB and format it for use by Sungard
- 5 operating systems with an average of 7 queries each
- There are also disparate CMDB tools
› Sungard uses HP
› CSG uses Bladelogic
• CSG has just automated this process leaving files on an FTP server for Sungard to retrieve
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.21
Best Lessons Learned from MRP Experience
Frequent reviews of the recovery documents ‐ not a one and done!
Foster a close working relationship with your Service Delivery Manager (SDM) and other members of your Sungard Account Team• Meet weekly to plan 2 disparate annual exercises
• Track changes in environments (Omaha, Atlanta, and Sungard Infrastructure)
Bringing CSG SME on‐site for Proof of Concept was incredibly valuable
Use the smaller April exercise to test items we want to improve on for August
Tracking change management during exercise is critical. • Develop a policy and make sure all engineers (CSG and Sungard) understand it
Track all issues before and during the exercise• Include things that worked well to insure they are not forgotten
Solicit feedback from all teams and compile a Lessons Learned document immediately after the exercise
Exchange Lessons Learned with Sungard after each exercise
Keep the equipment reservations up to date
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.22
Benefits To Date
Well documented recovery procedures at Sungard for both CSG Data Centers
Fully trained Sungard engineers ready to take on server builds for either CSG Data Center ATOD
ATOD workload balance between CSG System Administrators and Sungard Engineers
Good understanding of change management between and during exercises
Better understanding of ATOD recovery timeline
||
Recover My Environment
Manage My Recovery
Protect My Data
Managed Recovery Program
Customer Managed Recovery
Sungard AS Recovery-As-A-Service
Reduce Costs
& Risks
Network, WG, Other
Physical Infrastructure
Reduce Costs & Risks
Reduce Costs & Risks
Reduce Costs & Risks
Improve Data
Protection
Improve RTO
Reduce Tape
UsageMove data from customer’s location to a recovery center
“Right sizing” solutions with a Tiered Availability by SLA approach
23© 2014 Sungard Availability Services, all rights reserved
Virtual Infrastructure
||24
© 2014 Sungard Availability Services, all rights reserved
IT DR Layers Common Challenges
DR Program
System Recovery
Data Protection Not meeting RPOsBackup windows too longPoor backup success rates
Staffing not focused on DRPoor run booksLack of testing
Not meeting RTOsComplex interdependenciesCAPEX constraints
What is driving the need for specialized Recovery Services?
MRP Approach
Discover Production
Assess & Design Recovery Strategy
Recovery Implementation
& Execution
Recovery Lifecycle Management
» Infrastructure & Application Discovery» Populate CMDB in Sungard Systems» Baseline Scope for Recovery » Understand Change Management Process
» Analyze Discovered Information & Apply Recovery Best Practices
» Design Recovery Solution Architecture
» Implement Recovery Solution (e.g., server / storage replication; setup infrastructure ATOT)
» Test Execution » Test Management & Reporting
» Analyze Production Changes for Impact on Recovery
» Update Recovery Design, Plans & Procedures» Ongoing Recovery Optimization
Recommendations
Define Recovery Plans & Procedures
» Define Core Recovery Configuration (e.g. ,DNS, AD)
» Define Application Recovery Configuration » Define Application Recovery Plans &
Procedures
Dis
cove
r As
sess
Impl
emen
t &
Test
Def
ine
RLC
M
Discover
Manage
Run
Design
|
MRP Benefits
26© 2014 Sungard Availability Services, all rights reserved
Program Kept in Constant State of Readiness
Refined & clearly documented run books/procedures
Recovery environments kept in sync with production by integrating DR readiness into daily change control
Enablement of tiered recoverability
Measureable results & continuous improvement
Implementation of DR best practices & automation tools
24/7/365 state of readiness
SLA-backed, IT availability solution
Focused, Experienced & Available Staff
DR is sole focus & core competency
Expert global staffing model executes at time of test/at time of disaster
Subject matter expertise across all recovery disciplines, system platforms & backup technologies
Over 35 years & 3,300 disasters supported & executed successfully
Optimal Spend on Risk Mitigation
Protect investments made in production IT (both technology & people)
Production staff focused on revenue generating activities
OPEX alternative to CAPEX investments
Define best alternatives for tiered recoverability
Ensure the appropriate levels of spending on risk mitigation
Better Managed Complexity & Reduced Risk
Factual baseline of your production environment
Identification of critical business processes, application interdependencies & underlying infrastructure
Validation of data protection & recoverability state
Repurpose IT staff to focus on event mitigation & data synchronization
Ensure application-level recovery at agreed upon service levels
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.27
In Summary…
To get the most out of MRP, communicate often with your Sungard Account Team and
stay in touch with changes in your environment.
Thank you!
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.28
Cynthia L. Jenkins
Lead BCM Analyst
1‐402‐431‐7401
www.csgi.com
Contact Information
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.29
Additional Exercise Planning Information
This information is used during the planning and execution of CSG’s BC/DR Exercises
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.30
Tools for Planning an Exercise
One Place to Find All Exercise Related Information• Unique SharePoint site for each exercise
• All exercise information is found there
Weekly In‐house Status Meetings • Start the meeting by stating how many days until the exercise
• Keeps resources focused on their tasks
• Everyone informed of progress and issues
• Keeps teams aware of exercise timeline
• Follow‐up with teams not represented
• Sungard SDM and RSA are present
• Other Sungard resources invited as exercise gets closer
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.31
Tools for Developing Recovery Timeline
Tools Continued• Detailed Project Plan
- Created using Microsoft Project
- Details the execution from exercise beginning to end (72 hours)
- Shows dependencies between servers and products
- Shows priority of server builds and restores
• Status Tracker
- Excel Spreadsheet using output of detailed project plan for planned start and completion time of the phases of recovery
- Lists every server in exercise and tracks completion time for each phase
- Updated by exercise monitors
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.32
Tools for Managing an Exercise
Tools for Managing the Exercises• Collaboration – Screen Sharing and Conference Bridges
- Screen sharing and conferencing tools defined before exercise
- Publish the phone numbers and URLs in multiple places
- Main Conference Bridge – open for the entire exercise
- Secondary Bridges – used to work on and resolve specific issues
• Send out meeting invitations with pertinent information for entire exercise to all known participants
- Include the main bridge number
- Send secondary bridge information invitations to monitoring staff
• Instant Messaging
- Used for one off conversations
- Keeps conference bridge chatter down
- Use small group IM sessions for targeted conversations on specific issues
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.33
Tools for Monitoring the Exercise
Monitoring Exercise Progress – CSG uses the following strategy:• On‐site team to interface with Sungard and publish exercise
status
• 3 other monitors on conference bridge- 2 monitors to track server status changes and facilitate turnover
- 1 monitor to record and track issues
• Hold training classes for monitors so they know what is expected of them
Issue Tracking – CSG uses Remedy• Develop templates to prepopulate as many fields as possible
• Develop dropdown lists to make quick selections
• Encourage exercise participants to put technical information in IM windows so details are not lost and can be easily copied into tickets
Copyright© 2015 CSG Systems International, Inc. and/or its affiliates (“CSG International”). All rights reserved.34
Tools for Communicating with Executive Management
Go/No Go Report• Used to gain Executive Management approval
• Shows high level timeline
• Documents readiness for all teams
• Lists all possible change collisions
• Lists exercise risks and mitigation strategies
• States all internal team and external client communications and dates sent
6 Hour Progress Reports• Predefined email list
• Recaps progress since last report
• Lists significant issues resolved since last report
• Lists significant issues since last report
• Shows graphs of server recovery progress