BEST PRACTICES FOR RELIABLE CARRIER GRADE TELEPHONY Alistair Cunningham, Integrics Ltd.
description
Transcript of BEST PRACTICES FOR RELIABLE CARRIER GRADE TELEPHONY Alistair Cunningham, Integrics Ltd.
BEST PRACTICES FOR RELIABLE CARRIER GRADE
TELEPHONY
Alistair Cunningham, Integrics Ltd.
Reliability
• Think people and culture, not technology.• Complexity is the enemy.• Discipline is the answer.• Management must be willing to sacrifice
features.• Reliability for all customers is more
important than winning one new customer.
Staff Responibilities
• Assign a senior engineer as system manager.
• System manager has ultimate responsibility for whole system.
• Can delegate tasks to others.
Cluster Architecture
• Duplicate all important functions. Use heartbeat, DRBD/GFS, application level load balancing. Remember utilities.
• Consistency between machines is vital.• Virtual machines have more outages.• Monitor all machines, services, and
resources.• Daily and monthly backups.
Upgrades and Changes
• Risk is unpredicable and cumulative.• Many small changes are riskier than a few
large changes.• Test all changes on a staging machine
first.• Keep records of changes.• Consider change management system.• Keep customizations to a minimum.
Dealing with Vendors
• Vendors can never substitute for system manager.
• Give vendors access to staging machines but not production.
• Your staff must have debugging skills.• Subscribe to security mailing lists.
Causes of Outages
Most outages are caused by one of:
• Untested changes – use staging.• Hard disks filling up – use monitoring.• Power and network outages –
redundancy or split cluster.
Avoiding these three is usually sufficient to achieve good reliability.
Questions?