Scaling the Single Universe
Jacky Mallett, Distributed Systems [email protected]
Designing Single Universe games
• Known limits on scaling distributed systems.• Why large systems are different from small ones.• Is there a heart of darkness lurking at the core?• Why there can’t be a single, optimal solution.
Why is a single Universe such a Challenge?
EVE Online
• 360,000 subscribers• 63,170 concurrent users (Jan 23rd 2011)
– single cluster (everyone in the same virtual world)
• Up to 1800 users in most popular solar system• 1000+ users in fleet fights• Over 200 individual servers running solar system simulation
and support• 24 client access servers• One database server
EVE Tranquility ClusterPlayers connect to Proxies
SOLS (solar systems) host the real time game simulation
Dedicated game mechanic nodes. E.g. markets, fleets, etc.
“Message” Based Architecture
Players connect to Proxies
SOLS (solar systems) host the real time game simulation
Mesh vs. Embarrassingly Parallel
ServerServer Server
Scaling is relatively simple, as long as it’s a case of adding servers that have little or no need to communicate with each other…
Communication
Server
Svc.getCharInfo()Svc.getCharInfo()
Server
Svc.getCharInfo()Svc.getCharInfo()
Svc.getCharInfo()
Svc.getCharInfoResp()
Distributed Systems run on discrete units of computation, triggered by messages received on servers.
Incoming Server load from player updates
01,000
2,0003,000
4,0005,000
6,0007,000
8,0009,000
10,000
11,000
12,000
13,000
14,000
15,000
16,000
17,000
18,000
19,000
20,000
21,000
22,000
23,000
24,000
25,0000
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
Real Time Players vs Server Load
1 msg/minute1 msg/second5 msg/second30 msg/second
No. of Players
Msg
/s @
ser
ver
Network Communication Limits
Information Capacity is the instantaneous limit on the total amount of Shannon Information a distributed system can handle
Shannon Information = unique, single message E.g. Single remote procedure call
Where is the bottleneck?
SOL
SOLs: Load ≈ O(Proxies * Broadcast msg/s)Proxies: Load ≈ O(Sols * Broadcast msg/s)
It depends on the SOL : Proxy ratio. The number of broadcast/s dominates.
Proxy
Proxy
ProxySOL
Broadcast Load
Network Communication Limits
Where L is individual node link capacity with other nodes (group size) and N is total number of nodes in system :
Strictly Hierarchical Topology scales - O(Lserver) Full Mesh scales O(L√N) (Gupta & Kamar, Scaglione) Traffic worst case scales O(N2) as all the nodes try to talk to
all the other nodes simultaneously.
Model of Information Limits
Insufficient Information
Capacity
Accessible to mesh architectures, not to hierarchical
Maximum Load with increasing players
Mesh: O(L√N) : L = 40
Single Server: L = 40
Close up on Left Hand Corner
The one place where everything is possible – the Laboratory
Maximum Load with increasing players
Mesh: O(L√N) : L = 40
Single Server: L = 40
Effect of increasing connectivity limit
N(N-1)
L√N : L = 10
L√N : L = 50
L√N : L = 100
Conceptual Tools
Group Size Limit:Link Capacity ≤ No. of Clients (Players)
Define Group Size as the maximum number of direct communicators supported by available hardware.
Group Size Limit:L√N ≤ N(N-1)
Hierarchical vs Mesh Topologies
Hierarchical Mesh
Simple Complicated
Easy to provide single point of control Hard to provide single point of control
Synchronisation is what it does best Impossible to guarantee synchronisation
Fragile – single point of failure Robust – hard to bring down all nodes
Low Information Capacity High Information Capacity
Cannot scale to large N Scales to large N
Performs well with high latency communication
Performs badly with high latency communication
Latency influences scaling.
Round Trip Time affectsMessage Frequency
Message Latency – time to send and receive
Computation latency – how long it takes to process.
But what this also means is that several different problems can have the same symptoms.
Using Hierarchy to Load balance
Character Node
Character Node
Character Information was sourced by SOLS for their local players.
Character Information is static, easy to index per character, and has a long latency communication profile –> moved to its own nodes.
Advantages of Mesh Topology
We can divert load with long latency profiles to other serverse.g Character Nodes, Markets, Fleet Management Nodes.
Using Mesh to Load Balance
Jita
SOL
SOL
SOL
SOL
SOLTopology change – moved Jita from a hierarchical position to a mesh, thereby increasing Information capacity.
Fischer Consensus ProblemConsensus – agreement on shared values – is a critical problem when it occurs in distributed systems.
“It is impossible to guarantee that a set of asynchronously connected processes can agree on even a single bit value.”
“Impossibility of distributed consensus with one faulty process” Fischer, Lynch, Paterson 1985
http://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf
“Consensus: The big Misunderstanding.”Guerraoui, Schiper
http://www.cs.cornell.edu/Courses/cs734/2000fa/cached papers/gs97.pdf
Origin of the Consensus Problem
Fischer’s Consensus problem showed that it is impossible to distinguish between a delayed message and a lost one
Lost messages are equivalent to faulty processes
“Guarantee” - Most runs will achieve consensus.
But there is no possible protocol that can be guaranteed to converge
The larger the distributed system – and the more messages/hops involved the higher the probability of a failure or a delay.
Introduced by Message Handling
Message from A
Message from B
{Messages from unrelated tasks
• Reality of today’s computing environment is multiple layers of message queuing• As performance degrades, message delays increase, and the probability of a consensus
problem increases• Very hard to debug in real time systems, since debugging changes timing.
Symptoms of Consensus Problems
1. Sleep(1) - Reduces probability of delayed message2. Code complexity
Locally analysing every possible ‘delayed/lost message’ instance, and writing code to handle it
3. Regrettably this merely introduces more occurrences of the problem In the limit the developer goes quietly crazy
Note: Even with theoretically reliable processes, guaranteeing consensus is non-trivial and requires multiple message rounds
Which reduces available Information capacity
Solutions
• Consensus will usually occur, but can’t ever be guaranteed• The solutions are:
– Error recovery : “The end user will recover”– Synchronization
• But, we know that synchronizing at a single point will cause potential congestion.
Jumping typically involves several nodes
30s to resolve all cluster calls should be plenty…
Consensus Problems in Eve
• Number one cause of stuck petitions– Technical Support “Stuck Queue”– Game Master support for the “the end user will recover” solution.– Software design can certainly minimize them, but beyond a certain
point they have to be worked around.– Solutions to consensus all involve different kinds of performance trade
offs.
“Solutions” : Consensus, who needs it?
• It is possible to build systems that work with no consensus guarantees at all– E.g. Google, Facebook.
• Sensor network solutions have been proposed which exploit the over supply of information relative to the problem space
• Raise the probability of agreement to as close to 1 as possible, despite the presence of failure
• “Probabilistic consensus algorithms”• Generally speaking, these solutions can be made to work, but they typically
rely on multiple rounds of message passing, that are expensive for large systems
Eve Solution: Cache Early Cache Often
• All nodes cache data in EVE– SOLS cache to protect the Database from Load– Proxies cache to protect the SOLS– Clients cache to protect the Proxies
• Cached data is not synchronised, so at any given instance different nodes may have different views of some data
• Programmers can control relatively how out of date the view is
Solutions: Synchronisation• Shared clocks – literal synchronization
– Atomic clocks provide highly accurate synchronized time (Stratum 0)– The reliability of timing information provides extra local information that can be used to resolve
consensus failure.– Practically, network time based synchronization has its own problems and works best with long
latency systems.
• Lamport Timestamps– Simple algorithm that provides a partial ordering of events in a distributed system– Vector clocks– Both ways of providing a logical ordering of events/messages in a distributed system
"Time, clocks, and the ordering of events in a distributed system”Leslie Lamport, 1978
http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html
Solutions: Design
• Consensus requires synchronisation– Avoid the requirement for consensus where ever possible– Critical – will player‘s notice or care about inconsistent state?
• If consensus is required – a single server is needed somewhere to handle it
• Where possible design consensus to be a long latency process– Minimises probability of occurance.
Pod Kills in Last Hour
EVE Fleet Fights
First players converge on a node
Sometimes they try to do this all at once.
Load is distributed as much as possible over other cluster machines.
Latency Analysis
• Typical per player load on the cluster is < 0.5 msg/s• 30s Jump Timer is a game mechanic that provides long latency• During a fleet fight this goes up to > 1 msg/s
– Dedicated nodes used for known hot spots– Load balancing over evolution of game
• CPU load from fleet fight calculations has high computational latency– Highly optimized by CCP’s lag fighting Gridlock team.
• Load on proxies is engineered with spare capacity for bursts– Traffic measured on proxies is well distributed and stable
• Non-real time traffic is hosted on separate nodes
EVE Fleet Fights
For large numbers of players though, we’re always fighting the real time limits.
Design Approach
• Calculate load as calls per second/player and roughly estimate the number of players – How much CPU is each call going to require to process?– How much Network bandwidth?
• Is the design “embarrassingly parallel” or can it be made so?– Does it have high latency?– Some form of hierarchical approach should work
• If not - Is the problem solvable at all?– Mesh approach should work – Don’t try to solve the Fischer consensus problem!
• Can I change the problem and solve it at a higher level?– Fortunately Games give us enormous freedom to do this.
Takeaway.
• Very large scale distributed computing is a design problem.• Have a Model for the requirements of your system
– Multiple issues have the same symptom
• Limits apply at all levels of abstraction• The only systems that scale arbitrarily are the ones that don’t
communicate with each other.• Know the limits applicable for your application.• Don’t try to guarantee consensus.
Any Questions?