ABC Co. Network Implementation High reliability is primary concern – near 100% uptime required...

ABC Co. Network Implementation

• High reliability is primary concern– near 100% uptime required– Customer SLA has stiff penalty clauses– Everything is designed in a redundant fashion– Network redundancy not integrated with system

design or application design.– Application and system design not integrated– Management added last (to fix problems)

The challenge is always politics

• Politics prevents different parts of the company from working together.– Networking, Systems, and Applications are

three different groups.– Systems group own the management issues.– Some requirements get in the way:

• e.g. Management station must keep its data on the database server.

Network design

• “Dual Everything” is the design rule– Dual Routers/hubs (Cisco 5500’s)– Dual Ethernet– Dual attached systems

A simple picture

Rtr/Hub Rtr/Hub

Redundant net to customers

Dual rail Ethernet

Server a TNGDNSWins

Server n

More detail

• No actual “Ethernet bus”– Systems connect to 5500 via UTP– Each system connects to both 5500’s

• one connection is to “primary” LAN, other to secondary LAN

• Half have “left” 5500 as primary, other have “right” as primary.

• 5500s run OSPF and “router cluster” software

Problems...

• Server OS (NT and Unix) do not switch off the primary interface if it fails and will keep trying to use it. Applications hang and connections time out.

• DNS points only to one interface on each server.

• No automatic failover built into applications.

Management software must:

• Detect NIC failures

• Continue to monitor system agents in presence of network failures

• Correct server routing tables if primary interface fails (or the hub fails)

• Update DNS

• Notify operations as required.

Challenges

• Get each system to report all status via both NICs.

• Monitor system over both NICs.

• Prevent duplicate notifications.

• Fail over as fast as possible.

• Show connectivity of each system to both networks.

What needs to be done to do this?

• Modify auto discovery scripts to add each system twice as independent systems.– Requires private host file for name/address

translation (cannot depend on access to DNS)

• Invent system to recognize which interface is “active” and block those from other Nic(s)

More work...

• Duplicate any information in Object Repository that is needed to manage failover onto local system (cannot trust access to SQL server)

• Store current connectivity state for all servers (added ILPs to class definitions).

Tricks used

• Each system name in messages has code added to end to indicate interface address: (-p or -s)

• Most of the work is done in event message processing.– Each “raw” message is suppressed and a script

evoked to process it.– Ping success/failures used to switch state– Agent messages dropped base on state and p/s flag

Basic set of flows

• For each event, (other than pings)– If mode is P or S (kept in NT Registry), and

message is from S or P, discard.– Else, reformat message with real server name,

improve content (system class, etc.) and send back to event console as a new message

More Flow

• For each Ping Success/Fail reported:– Remember DSM has already done the retries– If failure, check to see if other port fails, too. If

the other port is dead, too, then declare the node down, and reset state to primary.

– If its primary, the do failover to secondary. If secondary, do a “failure” back to primary.

– Update DNS in all cases.

Router / Hub failure

• If the router/hub fails, invoke the primary failover script for each node connected to the primary side, and the secondary failover script for each node connected to the secondary side.– This is effectively all the nodes, so we don’t

have to wait for each to have a ping failure. The system will stabilize faster.

Does it work?

• You bet! It required:– Some special REXX scripts for failover– A few Basic programs– A hack to the auto discovery scripts.– Some magic with Trix and a few more basic

programs.

ABC Co. Network Implementation High reliability is primary concern – near 100% uptime required...

Documents

Transcript of ABC Co. Network Implementation High reliability is primary concern – near 100% uptime required...