Mendosus A SAN-Based Fault Injection Test-Bed for Construction of Highly Available Network Services...

Post on 19-Dec-2015

214 views 0 download

Tags:

Transcript of Mendosus A SAN-Based Fault Injection Test-Bed for Construction of Highly Available Network Services...

MendosusA SAN-Based Fault Injection Test-Bed for

Construction of Highly Available Network Services

Xiaoyan Li, Richard Martin, Kiran Nagaraja,

Thu D. Nguyen and Bin Zhang

Dept. of Computer Science, Rutgers University

http://www.panic-lab.rutgers.edu

Talk Outline

Motivation Design Implementation Benchmarks Case Studies Related Work Future Work

Motivation

Ubiquitous network access exponential growth in network services

Availability is one key challenge Networked systems are comprised of large numbers of

heterogeneous components Faults are not uncommon Complex interaction between components

Examples of costly failures: Ebay, Brittanica

Currently difficult to assess service availability How to analyze impact of failures? How to set up an appropriate test-bed?

Mendosus

Goal: provide infrastructure for service designers to assess the availability of network services

Overview: Provide flexible infrastructure to accurately model a

variety of different networking systems from the application’s point-of-view

Run application in real-time and inject faults to assess application’s behavior

Two key components: Real-time emulation of a variety of interconnects General fault injection infrastructure

Vision

Map available resources to emulated network

Design

Mendosus Architecture

Applications

KernelLatency

Routing

Fault Inclusion

Mendosus daemon

Central Controller

Network State

User Level

Fast & Reliable SAN

Emulator Module

Events

Design Decisions

Central controller Advantage: consistent network and fault information Disadvantage: limits scalability

Not involved in network emulation so should still scale well to targeted system sizes (thousands or tens of thousands of components)

Entire network state is maintained at each end node Advantage: performance Disadvantage: limits scalability

Only maintain state for LAN

Emulation module embedded within kernel Advantage: no modifications to application code Disadvantage: more difficult to modify and extend

Functional Components

Topology Maintenance

Fault Injection

Emulation

Topology Maintenance

Specification - simple ns-2 like topology scripts Specify available resources

Central controller manages topology Initializes original topology on each node Consistent view

Real time topology changes Specified as scripted events

Controller monitors network connectivity Detects partitions

Fault Injection

Every n/w component can have a fault profile Switches, hubs, NICs, links, end nodes

Fault specification: trace files or theoretical distributions Exponential, Weibull, constant

Simulate fail-stop components MTTR - constant or follow a distribution E.g. unplugging, port shutdown

Emulation

Completely distributed Every node has enough network state

Emulation Messaging sequence Application initiates communication Routing – determine route Fault Inclusion – effect of injected faults Latency – corresponding to route taken

We do not implement the innards of network components Switching

Implementation

Ethernet LAN Emulation

Routing Emulate computation of Ethernet spanning tree

Controller chooses root of tree Emulator on each node computes identical spanning tree

Reconfiguration performed periodically (every 2 secs)

Broadcast & Multicast Emulate using sequence of unicast

Ethernet LAN Emulation - Faults

Network partitions Controller monitors connectivity Multiple roots - one for each partition

NIC fail-over Multiple interfaces using IP aliasing support in Linux

Emulation completeness…

YesYesP-to-P

Software (multiple unicast)

HardwareBroadcast

Not implementedSome advanced switches

Layer 3, 4 services

E.g.VLAN, IGMP

Software(Broadcast w/ filters)

HardwareMulticast

Emulated Ethernet

EthernetFeature

Micro-benchmarks

Emulation Limits

53.479.61Emulator

54.879.18

130.066.00Gigabit Ethernet

88.911.81Fast Ethernet

RTT usecThroughput MB/sec

No. of Switches in Topology

Network

Software Broadcast Scaling

Fault View Convergence

Case Studies

Group Membership

Test protocol behavior under faults subtle interactions in distributed protocols

Three Round Membership algorithm Robust against multiple node failures, packet drops and

network partitions Two modes of operation: normal and FCM

Membership Observations

A C

B D

5. Link L up

4. Packet drops at A

3. NIC at B recovers

2. Link L down

1. NIC failure at B

1 2 3 4 5

L

Multi-Level Switched Network

Large enterprise LANs have multiple layers of network components Access, core and aggregation switches

How to evaluate availability vs. cost vs. complexity?

Study service availability with increased redundancy Faults following exponential distributions

Enterprise LAN

Availability Vs Redundancy

Related Work

Network Emulation Distributed emulation

Emulab [Utah], DelayLine

Centralized emulation NISTNET, Lancaster emulator

Fault injection Script-based probing and fault injection

Orchestra, DOCTOR

Co-related faults Loki [UIUC]

Simulation NS-2, REAL[Cornell], SSFNet, x-sim[Arizona]

Future Work

Extend Mendosus to emulate other networks WAN: Build in performance dynamics model Wireless LAN - Realistic fault and performance models

Support pluggable modules within network components which add functionality and additional failures ! Intelligent Routing protocols (E.g. HSRP) Dynamic DNS, RR DNS

Summary

Test-bed for service designers to systematically analyze network and protocol design against failures

Results show that real-time emulation is feasible given capability of current SAN networks

Demonstrated the flexibility and usefulness of Mendosus through 2 case studies

Another step towards building highly available services…