Reliability and State Machines in an Advanced Network Testbed
-
Upload
bathsheba-leonard -
Category
Documents
-
view
22 -
download
0
description
Transcript of Reliability and State Machines in an Advanced Network Testbed
![Page 1: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/1.jpg)
Reliability and State Machines in an Advanced
Network Testbed
Mac Newbold
School of ComputingUniversity of UtahMS Thesis Defense
April 5, 2004Advisor: Prof. Jay Lepreau
![Page 2: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/2.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 2
Distributed Systems• Distributed Systems are complex
– Many components– Distributed across multiple systems
• Component failures are relatively common– But should not cause system breakdown
• “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” – Leslie Lamport, quoted in CACM, June 1992
![Page 3: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/3.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 3
Our Context: Emulab• Emulab is an advanced network testbed• Complex time- and space-shared system• System dynamically reconfigures nodes and
network links to create “experiments”• Key architectural feature: Central Database
– System uses DB for storage, communication
• Complex system with many different scripts and programs on clients and servers
![Page 4: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/4.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 4
Emulab Background
• First prototype in April 2000 (10 nodes)• In production since Oct. 2000 (40 nodes)• Early versions weren’t perfect
– Reliability problems– Experiments of limited size– Inefficient use of resources
• Problem is becoming harder– 200 nodes, 400 remote, 2000 virtual nodes
![Page 5: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/5.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 5
Four Key Challenges
Emulab requirements:• Reliability• Scalability• Performance and Efficiency• Generality and Portability
![Page 6: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/6.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 6
1. Reliability
• Complex systems are hard to make reliable
• Many sources of unreliability:– Hardware – commodity PCs as nodes– Software – misconfiguration, bugs, etc.– Humans – can interrupt at any time
• More complexity and more parts mean higher chance that something is broken at any given time
![Page 7: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/7.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 7
2. Scalability
• Almost everything a testbed provides is harder to provide at a larger scale
• Larger scale requires more resources• If throughput doesn’t increase, things
slow down at larger scale• Increased load adversely affects
reliability• Practical scalability limited by
reliability, performance and efficiency
![Page 8: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/8.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 8
3. Performance and Efficiency
• One direct requirement on performance:– Emulab is used in an interactive style
• System tasks must complete in “a few minutes”
• Indirect requirements:– Scalability requirement places high
demands on many system components– Maximize efficient resource utilization
• As many users/experiments as possible in the shortest time with the fewest resources
![Page 9: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/9.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 9
4. Generality and Portability
• Workload Generality– Wide variety of research, teaching,
development, and testing activities– Good support for minimally and non-
instrumented client OS’s and devices
• Model generality– Evolving system– New types of network devices
• Portable and non-intrusive client software
![Page 10: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/10.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 10
Summary of Challenges
• Three challenges are closely related– Fourth is a constraint more than a
challenge
• Reliability is key• Failure rates directly impact scalability,
performance, and efficiency• Generality and portability requirements
constrain any solution used to address the challenges
![Page 11: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/11.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 11
Thesis Statement
Enhancing Emulab with a flexible framework
based on state machines provides better monitoring and control, and improves thereliability, scalability, performance, efficiency, and generality of the system.
![Page 12: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/12.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 12
Outline
• Introduction, Background, and Challenges
• Thesis Statement• State Machines in Emulab
– Interactions, Control Models, stated– Node Boot State Machines
• Results• Related and Future Work• Conclusion
![Page 13: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/13.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 13
State Machines
• Also called Finite State Machines (FSMs), Finite State Automata (FSAs), or Statecharts in UML
• Well-known model• Simple• Explicit model• Rich and flexible• Easy to understand and visualize
![Page 14: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/14.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 14
Example State Machine• Three main parts:• States• Transitions
– Directional
• Events– Associated with
transitions – labels
• Stored in database– diagrams generated
automatically from DB
![Page 15: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/15.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 15
State Machines in Emulab
• Each state machine has a “type”– Currently three: node boot, node allocation,
and experiment status
• Multiple machines allowed within a type– Only in one state in one machine of a type
• States can have “timeouts” with actions– Timer starts when state is entered
• State “triggers” – Entry Actions
![Page 16: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/16.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 16
Direct Interaction
• Within a type, can take a transition from a state in one machine (or “mode”) to a state in another machine of that type– Known as “mode transition” in Emulab– Similar to hierarchical state machines
• Highlights similarities/symmetries– Most machines are variations of another
• Improved code reuse
![Page 17: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/17.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 17
Direct Interaction Example
![Page 18: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/18.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 18
Models of State Machine Control
• Centralized monitoring & control: stated – State changes submitted, checked for
correctness, applicable actions performed– Daemon tracks timeouts– Used for Node Boot state machines
• Distributed management of state machine– No central service enforcing correctness– No dependency on central service– Timeouts harder to implement
![Page 19: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/19.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 19
The stated State Daemon
• Listens for events continuously• State transitions cause database
updates• Invalid transitions cause notifications• Timeouts, timeout actions, triggers
configurable in DB for each state• Caching – only writer of node boot states• Modular design – dispatch events to
proper action handlers
![Page 20: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/20.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 20
Node Boot State Machines
• Nodes in Emulab self-configure• Monitored via state machines – stated
• “Normal” node boot machine• Variations – “Minimal” • Reloading node disks
![Page 21: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/21.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 21
Node Self-Configuration• Nodes send state events during booting to
allow progress to be monitored• “Global knowledge” inside state daemon
– Better decisions about recovery steps
• Finer granularity gives more information for recovery, allows for shorter timeouts
• Each OS image is associated with a “mode” (state machine) that describes its behavior
![Page 22: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/22.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 22
“Normal” Node Boot Machine
• Start in SHUTDOWN• DHCP, start OS booting• When Emulab-specific
configuration begins, enter TBSETUP
• ISUP when finished• In case of failure, can
retry from SHUTDOWN
![Page 23: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/23.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 23
Variations of Node Boot
• Example: MINIMAL• For OS images with
little or no special Emulab support
• ISUP generated by stated if necessary– Immediate or ping
• SilentReboot allowed in this mode
![Page 24: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/24.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 24
Reloading Node Disks
• Mode transition into RELOAD / SHUTDOWN
• RELOADDONE transitions into mode for newly-installed OS image
![Page 25: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/25.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 25
Reloading and Mode Transitions
![Page 26: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/26.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 26
Experiment Status State Machine
• Uses distributed model – Stored in database, but
not strictly enforced
• Documents life-cycle• Restricts user
interruption– Reduces a source of errors
• Can queue, activate, modify, restart, swap, or terminate an expt.
![Page 27: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/27.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 27
Node Allocation State Machine
• Distributed control model• Diagram documents the
way the states are used by the program, but not currently enforced
• Either reloads nodes with a custom image, or reboots them as members of the experiment
![Page 28: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/28.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 28
Results: Context
• Emulab in production 1 year before state machines were added
• In production 3 years since first stated– 650 users, 150 projects, 75 institutions
• 19 papers, top venues
– Over 155,000 nodes allocated in nearly 10,000 experiment instances
– 13 classes at 10 universities
• Emulab SW on 6 more testbeds, 4 planned
![Page 29: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/29.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 29
Results
• Anecdotal: (others in thesis) – Reliability/Performance: Preventing race
conditions– Generality: Graceful handling of custom
OS images– Generality: New node types
• Experiment:– Reliability/Scalability: Improved timeout
mechanisms
![Page 30: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/30.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 30
Reliability/Performance: Preventing Race Conditions
• Expt. ends, nodes move to holding expt., get reloaded, then freed while they boot
• Problem: Getting allocated while booting • Node appears unresponsive, gets forcefully
power cycled, corrupts FS on disk• Solution: don’t free immediately• Add trigger on next ISUP for a node that
finishes booting, that frees it when booted
![Page 31: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/31.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 31
Generality: Graceful Handling
of Custom OS Images• Users create custom OS images• Emulab client software is optional• Problem: Nodes don’t send state events• Solution: “Minimal” state machine• SHUTDOWN: maybe on server, optional• BOOTING: server side, trigger checks ISUP• ISUP: either node sends, or generated
when pingable, or generated immediately
![Page 32: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/32.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 32
Generality: New Node Types
• Emulab is always growing and changing• State machine model and our framework
are flexible to provide graceful evolution• We’ve added 5 new node types
– IXPs, wide-area, PlanetLab, vnodes, sim-nodes
• Mostly used existing machines• 2 new machines, slight variations• 1 change to stated to add a new trigger
![Page 33: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/33.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 33
Reliability/Scalability: Improved Timeout
Mechanisms• Before: reboot node, wait for it to boot– Static, 7 minute timeout
• Pragmatic – minimizes false positives/negatives• Avg. 4 min., but max. error-free boot is 15 min.• 11 minute delay is too long
• Improved: state machine monitoring– Fine-grained, context-sensitive timeouts– Faster error detection– Better monitoring and control
![Page 34: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/34.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 34
Reliability/Scalability:Improved Timeout
Mechanisms• Experiment: Measure expt. swap-in time, with and without the improvements– Synthetic but plausible scenario
• One node, loads an RPM (8 min. install)• Node reboots, timeout during RPM install
– Reboots again, timeout again, mark node dead
– Try twice per swap-in, 3 swap-in attempts– Total failure in 45 min., 3 nodes “dead”
![Page 35: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/35.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 35
Reliability/Scalability:Improved Timeout
Mechanisms• With state machines:– Timeouts: SHUTDOWN 2 min, BOOTING 3
min, TBSETUP 10 min
• Node reboots, enters BOOTING• 1 minute: Enters TBSETUP• 9 minutes: Enters ISUP, expt. ready• Succeeds, with no dead nodes or
retries• Cut time from 45 min. to 9 min. (80%)
![Page 36: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/36.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 36
Limitations and Issues
• stated is critical infrastructure– Another single point of failure– More system complexity, new bugs,
complicated debugging– Potential for scaling problems (none seen
yet)
• Simple heuristics for error detection– Send mail for invalid transitions
![Page 37: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/37.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 37
Summary of Results• Explicit model requires careful thought
– Improves design and implementation
• Visualization makes it easier to understand• Faster and more accurate error detection• Better reliability helps scalability/efficiency
– Bigger expts. possible, less overhead per expt.
• Flexibility for evolution, workload generality
![Page 38: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/38.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 38
Related Work
• “Standard” Finite State Automata – basics– Timed Automata – have global clock
• Message Sequence Charts (MSCs)– “Scenarios” – hierarchy, like
modes/machines
• UML Statecharts– States have entry actions – “triggers”– Hierarchical states – similar to modes– Can model Emulab’s timeouts
![Page 39: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/39.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 39
Future Work
• Further developing distributed control– Add monitoring, timeouts, triggers
• Better heuristics for error detection– Only flag clustered or related errors
• Implement more ideas from other systems– UML’s exit actions, guarded transitions, etc.
• Move code into database – i.e. triggers– Easier to modify, framework code vs. machine
![Page 40: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/40.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 40
Conclusion
Enhancing Emulab with a flexible framework
based on state machines provides better monitoring and control, and improves thereliability, scalability, performance, efficiency, and generality of the system.
![Page 41: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/41.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 41
Bonus Slides
![Page 42: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/42.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 42
Demonstrating Improvement• Currently: programs have their own retry
and timeout mechanisms for node reboots– No knowledge of progress, just completion– Can cause failures by forcing a reboot, which
can damage file systems on node disk• “New Way”: stated handles timeout and
retry during rebooting– Implemented, not installed– Knows if progress is being made– Programs simply wait for ISUP or failure event
![Page 43: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/43.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 43
Demonstrating Improvement (cont’d)
• These failures directly hurt reliability– Node failure can cause experiment setup
to fail
• Significant impact on scaling, performance, efficiency
• Maximum experiment size is limited by node failure rate– Failures make things take longer– A slower system means less efficient use
of resources
![Page 44: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/44.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 44
Demonstrating Improvement (cont’d)
• Compare current vs. new: failure rate, time to completion, etc.
• Test data; one of:– Historical experiments– Artificially high load– Fault injection, e.g. reboots
• Why new way should help:– Better knowledge for intelligent recovery
• Know when to wait longer and when to retry
– Shorter timeouts allow for early error detection
![Page 45: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/45.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 45
Future Work:Modeling Indirect Interaction
• Occur between machines of different types– Due to external relationships between the
entities tracked by each type of machine• Examples:
– Same entity may be tracked in two different types of machine• Nodes are in Boot and Allocation machines
– Other relationship between entities• Nodes may be “owned” or allocated to an
experiment – links Expt. Status and Node machines
![Page 46: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/46.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 46
Question and Answer
![Page 47: Reliability and State Machines in an Advanced Network Testbed](https://reader035.fdocuments.us/reader035/viewer/2022062719/5681306d550346895d964e88/html5/thumbnails/47.jpg)
April 5th, 2004 Mac Newbold - MS Thesis Defense 47
Conclusion
Enhancing Emulab with a flexible framework
based on state machines provides better monitoring and control, and improves thereliability, scalability, performance, efficiency, and generality of the system.