Internet Server ClustersInternet Server Clusters
Jeff Chase
Duke University, Department of Computer Science
CPS 212: Distributed Information Systems
Using Clusters for Scalable ServicesUsing Clusters for Scalable Services
Clustersare a common vehicle for improving scalability andavailability at a single service site in the network.
Are network services the “Killer App” for clusters?
• incremental scalabilityjust wheel in another box...
• excellent price/performancehigh-end PCs are commodities: high-volume, low margins
• fault-tolerance“simply a matter of software”
• high-speed cluster interconnects are on the marketSANs + Gigabit Ethernet...
cluster nodes can coordinate to serve requests w/ low latency
• “shared nothing”
[Fox/Brewer]: SNS, TACC, and All That[Fox/Brewer]: SNS, TACC, and All That
[Fox/Brewer97] proposes a cluster-based reusable softwareinfrastructure for scalable network services (“SNS”), such as:
• TranSend: scalable, active proxy middleware for the Web
think of it as a dial-up ISP in a box, in use at Berkeley
distills/transforms pages based on user request profiles
• Inktomi/HotBotsearch engine
core technology for Inktomi Inc., today with $15B market cap.
“bringing parallel computing technology to the Internet”
Potential services are based onTransformation,Aggregation,Caching, andCustomization (TACC), built above SNS.
TACCTACC
Vision: deliver “the content you want” by viewing HTML content asa dynamic, mutable medium.
1. TransformInternet content according to:
• network and client needs/limitations
e.g., on-the-fly compression/distillation [ASPLOS96], packagingWeb pages for PalmPilots, encryption, etc.
• directed by user profile database
2. Aggregatecontent from different back-end services or resources.
3. Cachecontent to reduce cost/latency of delivery.
4. Customize(seeTransform)
TranSendTranSendStructureStructure
$
$
$
FrontEnds
ProfilesControlPanel
html gif jpg
To Internet
SAN (high speed)Utility (10baseT)Coordination bus
$ Cache partition
... Datatype-specific distiller
[adapted from Armando Fox (throughhttp://ninja.cs.berkeley.edu/pubs)]
SNS/TACC PhilosophySNS/TACC Philosophy
1. Specify services by plugging generic programs into the TACCframework, and compose them as needed.
sort of like CGI with pipes
run by long-livedworkerprocesses that serve request queues
allows multiple languages, etc.
2. Worker processes in the TACC framework are looselycoordinated, independent, and stateless.
ACID vs. BASE
serve independent requests from multiple users
narrow view of a “service”: one-shot readonly requests, and staledata is OK
3. Handle bursts with designatedoverflow poolof machines.
TACC ExamplesTACC Examples
HotBotsearch engine
• Query crawler’s DB
• Cache recent searches
• Customize UI/presentation
TranSendtransformation proxy
• On-the-fly lossy compression of inline images(GIF, JPG, etc.)
• Cache original & transformed
• User specifies aggressiveness, “refinement”UI, etc.
C TT
$$AA
TT
$$
C
DBDB
htmlhtml
[Fox]
(Worker) Ignorance Is Bliss(Worker) Ignorance Is Bliss
What workers don’t need to know
• Data sources/sinks
• User customization (key/value pairs)
• Access to cache
• Communication with other workers by name
Common case: stateless workers
C, Perl, Java supported
• Recompilation often unnecessary
• Useful tasks possible in <10 lines of (buggy) Perl
[Fox]
QuestionsQuestions
1. What are the research contributions of the paper?system architecture decouples SNS concerns from content
TACC programming model composes stateless worker modules
validation using two real services, with measurements
How is this different from clusters for parallel computing?
2. How is this different from clusters for parallel computing?
3. What are the barriers to scale in SNS/TACC?
4. How are requests distributed to caches, FEs, workers?
5. What can we learn from the quantitative results?
6. What about services that allow client requests to update shareddata?
e.g., message boards, calendars, mail,
SNS/TACC Functional IssuesSNS/TACC Functional Issues
1. What about fault-tolerance?
• Service restrictions allow simple, low-cost mechanisms.
Primary/backup process replication is not necessary with BASEmodel and stateless workers.
• Uses aprocess-peerapproach to restart failed processes.
Processes monitor each other’s health and restart if necessary.
Workers and manager find each other with “beacons” on well-known ports.
2. Load balancing?
• Manager gathers load info and distributes to front-ends.
• How are incoming requests distributed to front-ends?
Porcupine: A Highly Available ClusterPorcupine: A Highly Available Cluster--based Mail Servicebased Mail Service
Yasushi Saito
Brian Bershad
Hank Levy
University of WashingtonDepartment of Computer Science and Engineering,
Seattle, WA
http://porcupine.cs.washington.edu/
[Saito]
Why Email?Why Email?
Mail is importantReal demand
Mail is hardWrite intensive
Low locality
Mail is easyWell-defined API
Large parallelism
Weak consistency
[Saito]
How much of Porcupine isreusable to other services?
Can we use the SNS/TACCframework for this?
GoalsGoals
Use commodity hardware to build a large, scalable mail service
Three facets of scalability ...
Performance: Linear increase with cluster size
Manageability: React to changes automatically
Availability: Survive failures gracefully
[Saito]
Conventional Mail SolutionConventional Mail Solution
Static partitioning
Performance problems:No dynamic load balancing
Manageability problems:Manual data partition decision
Availability problems:
Limited fault tolerance
SMTP/IMAP/POP
Bob’smbox
Ann’smbox
Joe’smbox
Suzy’smbox
NFS servers
[Saito]
Key Techniques and RelationshipsKey Techniques and Relationships
Functional Homogeneity“any node can perform any task”
AutomaticReconfiguration
LoadBalancingReplication
Manageability PerformanceAvailability
Framework
Techniques
Goals
[Saito]
Porcupine ArchitecturePorcupine Architecture
Node A ...Node B Node Z...
SMTPserver
POPserver
IMAPserver
Mail mapMailboxstorage
Userprofile
Replication Manager
MembershipManager
RPC
Load Balancer
User map
[Saito]
Porcupine OperationsPorcupine Operations
ÿþýüûþüý
A B...
A
� � � �üþ����� ý�����
� � � � ���þ��ü����� ÿ �
� � � �üû������ �
�� ���� ý�ü �ü�ýþ��ü� ý� �ý�ûüþü� ��� ÿ �
� � � �� ��ü�ü�ý��þ
� � �� � ���� � ������ �þ� �þ� � � � �� ý�ûü
����B
C
Protocolhandling
Userlookup
LoadBalancing
Messagestore
...C
[Saito]
Basic Data StructuresBasic Data Structures“bob”
BCA CA BA C
bob : {A,C}ann : {B}
BCA CA BA C
suzy : {A,C} joe : {B}
BCA CA BA C
Apply hashfunction
User map
Mail map/user info
Mailboxstorage
A B C
Bob’sMSGs
Suzy’sMSGs
Bob’sMSGs
Joe’sMSGs
Ann’sMSGs
Suzy’sMSGs
[Saito]
fragment list
mailbox fragments
Porcupine AdvantagesPorcupine Advantages
Advantages:Optimal resource utilization
Automatic reconfiguration and task re-distribution upon nodefailure/recovery
Fine-grain load balancing
Results:Better Availability
Better Manageability
Better Performance
[Saito]
AvailabilityAvailability
Goals:Maintain function after failuresReact quickly to changes regardless of cluster sizeGraceful performance degradation / improvement
Strategy:Two complementary mechanisms
Hard state:email messages, user profileÿ Optimistic fine-grain replication
Soft state: user map, mail mapÿ Reconstruction after membership change
[Saito]
SoftSoft--state Reconstructionstate Reconstruction
B C A B A B A C
bob : {A,C}
joe : {C}B C A B A B A C
B A A B A B A B
bob : {A,C}
joe : {C}B A A B A B A B
A C A C A C A C
bob : {A,C}
joe : {C}A C A C A C A C
suzy : {A,B}
ann : {B}
1. Membership protocolUsermap recomputation
2. Distributeddisk scan
suzy :
ann :
Timeline
A
B
ann : {B}
B C A B A B A Csuzy : {A,B}
Cann : {B}
B C A B A B A Csuzy : {A,B}
ann : {B}
B C A B A B A Csuzy : {A,B}
[Saito]
suzy
ann
How does Porcupine React toHow does Porcupine React toConfiguration Changes?Configuration Changes?
300
400
500
600
700
0 100 200 300 400 500 600 700 800Time(seconds)
Messages/second
No failure
One nodefailureThree nodefailuresSix nodefailures
Nodesfail
Newmembershipdetermined
Nodesrecover
Newmembershipdetermined
[Saito]
HardHard--state Replicationstate Replication
Goals:Keep serving hard state after failures
Handle unusual failure modes
Strategy: Exploit Internet semanticsOptimistic, eventually consistent replication
Per-message, per-user-profile replication
Efficient during normal operation
Small window of inconsistency
[Saito]
How will Porcupine behave in a partition failure?
More on Porcupine ReplicationMore on Porcupine Replication
To add/delete/modify a message:• Find and update any replica of the mailbox fragment.
Do whatever it takes: make a new fragment if necessary...pick anew replica if chosen replica does not respond.
• Replica asynchronously transmits updates to other fragment replicas.continuous reconciling of replica states
• Log/force pending update state, and target nodes to receive update.on recovery, continue transmitting updates where you left off
• Order updates byloosely synchronizedphysical clocks.Clock skew should be less than the inter-arrival gap for a sequence
of order-dependent requests...usenodeIDto break ties.
• How many node failures can Porcupine survive? What happens ifnodes fail “forever”?
How Efficient is Replication?How Efficient is Replication?
0100200300400500600700800
0 5 10 15 20 25 30Cluster size
Mes
sage
s/se
cond
Porcupine no replication
Porcupine with replication=2
68m/day
24m/day
[Saito]
How Efficient is Replication?How Efficient is Replication?
0100200300400500600700800
0 5 10 15 20 25 30Cluster size
Mes
sage
s/se
cond
Porcupine no replication
Porcupine with replication=2
Porcupine with replication=2, NVRAM
68m/day
24m/day33m/day
[Saito]
Load balancing: Deciding where to store messagesLoad balancing: Deciding where to store messages
Goals:Handle skewed workload well
Support hardware heterogeneity
No voodoo parameter tuning
Strategy: Spread-based load balancingSpread:soft limit on # of nodes per mailbox
Large spreadÿ better load balance
Small spreadÿ better affinity
Load balanced within spread
Use # of pending I/O requests as the load measure
[Saito]
QuestionsQuestions• How to select the front-end node to handle the request? Does it
matter which one we choose?
• Don’t we already know how to build big mail servers? (e.g.,Earthlink, Christenson USITS97) Why do we need Porcupine?
• What properties of the mail “data model” allow this approach, withweaker consistency guarantees than a database?
• How does the system leverage/exploit the weaker semantics?
• Can the architecture accommodate new features, e.g., Pachyderm-like storage/indexing of large mail collections?
• Could I run Porcupine on the same cluster with other applications?
• Could this have been built on Microsoft’s MSCS? How muchapplication effort would have been saved?
Clusters: A Broader ViewClusters: A Broader View
MSCS (“Wolfpack”) is designed as basic infrastructure forcommercial applications on clusters.
• “A cluster service is a package of fault-tolerance primitives.”
• Service handles startup, resource migration, failover, restart.
• But: apps may need to be “cluster-aware”.
Apps must participate in recovery of their internal state.
Use facilities for logging, checkpointing, replication, etc.
• Service and node OS supports uniform naming and virtualenvironments.
Preserve continuity of access to migrated resources.
Preserve continuity of the environment for migrated resources.
Wolfpack: ResourcesWolfpack: Resources
• The components of a cluster arenodesandresources.
Shared nothing: each resource is owned by exactly one node.
• Resources may bephysicalor logical.
Disks, servers, databases, mailbox fragments, IP addresses,...
• Resources have types, attributes, and expected behavior.
• (Logical) resources are aggregated inresource groups.
Each resource is assigned to at most one group.
• Some resources/groups depend on other resources/groups.
Admin-installed registry lists resources and dependency tree.
• Resources can fail.
cluster service/resource managers detect failures.
FaultFault--Tolerant Systems: The Big PictureTolerant Systems: The Big Picture
messagingsystem
file/storagesystem
database mail service clusterservice
applicationservice
applicationservice
redundant hardwareparityECC
replicationRAID parity
checksumack/retransmission
replicationlogging
checkpointingvoting
replicationlogging
checkpointingvoting
Note:dependenciesredundancy at any/each/every levelwhat failure semantics to the level above?
Wolfpack: Resource Placement and MigrationWolfpack: Resource Placement and Migration
The cluster service detects component failures and responds byrestarting resources or migrating resource groups.
• Restart resource in place if possible...
• ...else find another appropriate node and migrate/restart.
Ideally, migration/restart/failover is transparent.
• Logical resources (processes) execute in virtual environments.
uniform name space for files, registry, OS objects (NT mods)
• Node physical clocks are loosely synchronized, with clock drift lessthan minimal time for recovery/migration/restart.
guarantees migrated resource sees monotonically increasing clocks
• Route resource requests to the node hosting the resource.
• Is the failure visible to other resources that depend on the resource?
Membership 101Membership 101
Cluster nodes must agree on the set of cluster members (theview).
• distribute resource ownership effectively
shift resources on node failures or additions
• eliminate dangerous/expensive interactions with faulty nodes
• “keep everyone in the loop” on updates and events
e.g., multicast groups and group communication
The literature on group membership is tangled up with the problemof ordered multicast (e.g., “CATOCS”).
• What are the ordering guarantees for message delivery, especiallywith respect to membership changes?
• Ordered group communication is controversial, but everyone needsa solution for the separate but related membership problem.
Failure DetectorsFailure Detectors
First problem: how to detect that a member has failed?
• pings, timeouts, beacons, heartbeats
• recovery notifications
“I was gone for awhile, but now I’m back.”
Is the failure detectoraccurate?
Is the failure detectorlive?
In an asynchronous system, it is possible for a failure detector to beaccurate or live, but not both.
• As it turns out, it is impossible for an asynchronous system to agreeon anythingwith accuracy and liveness!
• But this is academic...
Failure Detectors in Real SystemsFailure Detectors in Real Systems
Common solution:• Use a failure detector that is live but not accurate.
Assume bounded processing delays and delivery times.
Timeout with multiple retries detects failure accurately with highprobability.
If a “failed” site turns out to be alive, then kill it (fencing).
• Use a recovery detector that is accurate but not live.“I’m back....hey, did anyone hear me?”
What do we assume about communication failures?How much pinging is enough?
1-to-N, N-to-N, ring?
What about network partitions?
Membership ServiceMembership Service
Second problem: How to propagate knowledge of failure/recoveryevents to other nodes?
• Surviving nodes should agree on the new view (regrouping).
• Convergence should be rapid.
• The regrouping protocol should itself be tolerant of message drops,message reorderings, and failures.
liveness and accuracy again
• The regrouping protocol should be scalable.
• The protocol should handle network partitions.
• Behavior of the messaging system (e.g., group multicast) acrossmembership changes must be well-specified and understood.
Example: WombatExample: Wombat
• Wombat is a new membership protocol, an outgrowth of Porcupine.
Gretta Bartels, University of Washington, Duke ‘98
• Wombat is empirically more efficient/scalable than competingalgorithms such asThree Round.
• But: Wombat makes no guarantees about the relative ordering ofmembership events and messages.
Adherents of group communication would not accept it as a “real”membership protocol.
• Wombat’s assumptions have not been formally defined, and itsproperties have not been proven.
If you can’t prove that it works, you can’t believe that it works.
• Disclaimer: Wombat is a promising work in progress.
Wombat BasicsWombat Basics
ping
ping
leader
minions
Nodes are ranked by unique IDs.
Node IDs are permanent.
Nodei pingspredecessor(i).
The highest-ranked node is theleader.
All other nodes areminions.
The leader periodically broadcasts itsviewto all known minions.
physicalbroadcast
Minions adopt the leader’s view.
determine pred from leader’s view
Node Arrival/Recovery in WombatNode Arrival/Recovery in Wombat
If node i joins the cluster:
1. i waits for the leader’s next beacon.
2. i detects that the leader’s view doesnot includei.
3. i notifies the leader.
4. The leader updates its view.
5. The leader broadcasts its new view.
6. Minions adopt the leader’s view.
“I’m here too.”
i
Node Failure in WombatNode Failure in Wombat
If a node fails:
1. Its successor notifies the leader.
2. The leader updates its view.
3. The leader broadcasts its view.
4. Minions adopt the leader’s new view.
5. Life goes on.
X
“Node i has failed.”
i
Leader Failure in WombatLeader Failure in Wombat
If the leader fails:
1. Successor detects the failure.
2. Successor knows that the failed nodewas the leader.
3. Successor broadcasts as leader.
4. Minions adopt the new leader’s view.
5. Life goes on.
X “I am in control.”
Multiple Failures in WombatMultiple Failures in Wombat
If the leader and its successor(s) fail(s),the next ranking node must assumecommand on its own.
1. Each node has abroadcast timer; ifthe timer goes off, broadcast as leader.
2. Each node’s timer is set by its rank.
if i< j thentimer(i)<timer(j)
3. Reset timer on each beacon.
4. Leader’s timer value is adaptive.
Go faster if things are changing.
X
“I must be incontrol.”
X
Suppressing False LeadersSuppressing False Leaders
If a node falsely broadcasts as leader:
1. All nodes that know of a better leaderrecognize the usurper as such.
2. The real leader recognizes that it is abetter leader than the usurper.
3. The real leader broadcasts the union ofits view and the usurper’s view.
4. The usurper shuts up and adopts thereal leader’s view.
What if the “real leader” is dead?
X
“I must be incontrol.”
“I don’tthink so.”
Partitions in WombatPartitions in WombatpartitionleaderIf a network failure partitions the
cluster:
1. The old partition continues.
2. The leader of the new partitioneventually broadcasts its view.
3. Minions accept the new leader’sview.
partitionleader
notion
Healing a PartitionHealing a Partitiondominating
leaderWhen the partition heals, either:
1. The dominating partition leaderhears a false broadcast, and...
2. ...corrects it by broadcasting theunion of the views.
- or -
1. The dominating partition leaderbroadcasts first, and...
2. ...minions respond “I’m here”.
partitionleader
Wombat: WrinklesWombat: Wrinkles
1. What are the assumptions about:• network?
• clocks?
2. Are these reasonable/realistic assumptions?
3. How to ensure a single cluster view in the event of a partition?
4. How long does it take for the view to converge after a partition?
5. How do we start a cluster? What if a node starts or recovers butnever receives a beacon?
6. What about the ordering of messages and membership events?
7. How do minions come to accept a new leader?
8. What about “message storms”?
Top Related