Post on 19-Dec-2015
MendosusA SAN-Based Fault Injection Test-Bed for
Construction of Highly Available Network Services
Xiaoyan Li, Richard Martin, Kiran Nagaraja,
Thu D. Nguyen and Bin Zhang
Dept. of Computer Science, Rutgers University
http://www.panic-lab.rutgers.edu
Talk Outline
Motivation Design Implementation Benchmarks Case Studies Related Work Future Work
Motivation
Ubiquitous network access exponential growth in network services
Availability is one key challenge Networked systems are comprised of large numbers of
heterogeneous components Faults are not uncommon Complex interaction between components
Examples of costly failures: Ebay, Brittanica
Currently difficult to assess service availability How to analyze impact of failures? How to set up an appropriate test-bed?
Mendosus
Goal: provide infrastructure for service designers to assess the availability of network services
Overview: Provide flexible infrastructure to accurately model a
variety of different networking systems from the application’s point-of-view
Run application in real-time and inject faults to assess application’s behavior
Two key components: Real-time emulation of a variety of interconnects General fault injection infrastructure
Vision
Map available resources to emulated network
Design
Mendosus Architecture
Applications
KernelLatency
Routing
Fault Inclusion
Mendosus daemon
Central Controller
Network State
User Level
Fast & Reliable SAN
Emulator Module
Events
Design Decisions
Central controller Advantage: consistent network and fault information Disadvantage: limits scalability
Not involved in network emulation so should still scale well to targeted system sizes (thousands or tens of thousands of components)
Entire network state is maintained at each end node Advantage: performance Disadvantage: limits scalability
Only maintain state for LAN
Emulation module embedded within kernel Advantage: no modifications to application code Disadvantage: more difficult to modify and extend
Functional Components
Topology Maintenance
Fault Injection
Emulation
Topology Maintenance
Specification - simple ns-2 like topology scripts Specify available resources
Central controller manages topology Initializes original topology on each node Consistent view
Real time topology changes Specified as scripted events
Controller monitors network connectivity Detects partitions
Fault Injection
Every n/w component can have a fault profile Switches, hubs, NICs, links, end nodes
Fault specification: trace files or theoretical distributions Exponential, Weibull, constant
Simulate fail-stop components MTTR - constant or follow a distribution E.g. unplugging, port shutdown
Emulation
Completely distributed Every node has enough network state
Emulation Messaging sequence Application initiates communication Routing – determine route Fault Inclusion – effect of injected faults Latency – corresponding to route taken
We do not implement the innards of network components Switching
Implementation
Ethernet LAN Emulation
Routing Emulate computation of Ethernet spanning tree
Controller chooses root of tree Emulator on each node computes identical spanning tree
Reconfiguration performed periodically (every 2 secs)
Broadcast & Multicast Emulate using sequence of unicast
Ethernet LAN Emulation - Faults
Network partitions Controller monitors connectivity Multiple roots - one for each partition
NIC fail-over Multiple interfaces using IP aliasing support in Linux
Emulation completeness…
YesYesP-to-P
Software (multiple unicast)
HardwareBroadcast
Not implementedSome advanced switches
Layer 3, 4 services
E.g.VLAN, IGMP
Software(Broadcast w/ filters)
HardwareMulticast
Emulated Ethernet
EthernetFeature
Micro-benchmarks
Emulation Limits
53.479.61Emulator
54.879.18
130.066.00Gigabit Ethernet
88.911.81Fast Ethernet
RTT usecThroughput MB/sec
No. of Switches in Topology
Network
Software Broadcast Scaling
Fault View Convergence
Case Studies
Group Membership
Test protocol behavior under faults subtle interactions in distributed protocols
Three Round Membership algorithm Robust against multiple node failures, packet drops and
network partitions Two modes of operation: normal and FCM
Membership Observations
A C
B D
5. Link L up
4. Packet drops at A
3. NIC at B recovers
2. Link L down
1. NIC failure at B
1 2 3 4 5
L
Multi-Level Switched Network
Large enterprise LANs have multiple layers of network components Access, core and aggregation switches
How to evaluate availability vs. cost vs. complexity?
Study service availability with increased redundancy Faults following exponential distributions
Enterprise LAN
Availability Vs Redundancy
Related Work
Network Emulation Distributed emulation
Emulab [Utah], DelayLine
Centralized emulation NISTNET, Lancaster emulator
Fault injection Script-based probing and fault injection
Orchestra, DOCTOR
Co-related faults Loki [UIUC]
Simulation NS-2, REAL[Cornell], SSFNet, x-sim[Arizona]
Future Work
Extend Mendosus to emulate other networks WAN: Build in performance dynamics model Wireless LAN - Realistic fault and performance models
Support pluggable modules within network components which add functionality and additional failures ! Intelligent Routing protocols (E.g. HSRP) Dynamic DNS, RR DNS
Summary
Test-bed for service designers to systematically analyze network and protocol design against failures
Results show that real-time emulation is feasible given capability of current SAN networks
Demonstrated the flexibility and usefulness of Mendosus through 2 case studies
Another step towards building highly available services…