scaling the single universe jacky mallett, distributed systems architect [email protected]

Scaling the Single Universe

Jacky Mallett, Distributed Systems [email protected]

Designing Single Universe games

• Known limits on scaling distributed systems.• Why large systems are different from small ones.• Is there a heart of darkness lurking at the core?• Why there can’t be a single, optimal solution.

Why is a single Universe such a Challenge?

EVE Online

• 360,000 subscribers• 63,170 concurrent users (Jan 23rd 2011)

– single cluster (everyone in the same virtual world)

• Up to 1800 users in most popular solar system• 1000+ users in fleet fights• Over 200 individual servers running solar system simulation

and support• 24 client access servers• One database server

EVE Tranquility ClusterPlayers connect to Proxies

SOLS (solar systems) host the real time game simulation

Dedicated game mechanic nodes. E.g. markets, fleets, etc.

“Message” Based Architecture

Players connect to Proxies

SOLS (solar systems) host the real time game simulation

Mesh vs. Embarrassingly Parallel

ServerServer Server

Scaling is relatively simple, as long as it’s a case of adding servers that have little or no need to communicate with each other…

Communication

Server

Svc.getCharInfo()Svc.getCharInfo()

Server

Svc.getCharInfo()Svc.getCharInfo()

Svc.getCharInfo()

Svc.getCharInfoResp()

Distributed Systems run on discrete units of computation, triggered by messages received on servers.

Incoming Server load from player updates

01,000

2,0003,000

4,0005,000

6,0007,000

8,0009,000

10,000

11,000

12,000

13,000

14,000

15,000

16,000

17,000

18,000

19,000

20,000

21,000

22,000

23,000

24,000

25,0000

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

Real Time Players vs Server Load

1 msg/minute1 msg/second5 msg/second30 msg/second

No. of Players

Msg

/s @

ser

ver

Network Communication Limits

Information Capacity is the instantaneous limit on the total amount of Shannon Information a distributed system can handle

Shannon Information = unique, single message E.g. Single remote procedure call

Where is the bottleneck?

SOL

SOLs: Load ≈ O(Proxies * Broadcast msg/s)Proxies: Load ≈ O(Sols * Broadcast msg/s)

It depends on the SOL : Proxy ratio. The number of broadcast/s dominates.

Proxy

Proxy

ProxySOL

Broadcast Load

Network Communication Limits

Where L is individual node link capacity with other nodes (group size) and N is total number of nodes in system :

Strictly Hierarchical Topology scales - O(Lserver) Full Mesh scales O(L√N) (Gupta & Kamar, Scaglione) Traffic worst case scales O(N2) as all the nodes try to talk to

all the other nodes simultaneously.

Model of Information Limits

Insufficient Information

Capacity

Accessible to mesh architectures, not to hierarchical

Maximum Load with increasing players

Mesh: O(L√N) : L = 40

Single Server: L = 40

Close up on Left Hand Corner

The one place where everything is possible – the Laboratory

Maximum Load with increasing players

Mesh: O(L√N) : L = 40

Single Server: L = 40

Effect of increasing connectivity limit

N(N-1)

L√N : L = 10

L√N : L = 50

L√N : L = 100

Conceptual Tools

Group Size Limit:Link Capacity ≤ No. of Clients (Players)

Define Group Size as the maximum number of direct communicators supported by available hardware.

Group Size Limit:L√N ≤ N(N-1)

Hierarchical vs Mesh Topologies

Hierarchical Mesh

Simple Complicated

Easy to provide single point of control Hard to provide single point of control

Synchronisation is what it does best Impossible to guarantee synchronisation

Fragile – single point of failure Robust – hard to bring down all nodes

Low Information Capacity High Information Capacity

Cannot scale to large N Scales to large N

Performs well with high latency communication

Performs badly with high latency communication

Latency influences scaling.

Round Trip Time affectsMessage Frequency

Message Latency – time to send and receive

Computation latency – how long it takes to process.

But what this also means is that several different problems can have the same symptoms.

Using Hierarchy to Load balance

Character Node

Character Node

Character Information was sourced by SOLS for their local players.

Character Information is static, easy to index per character, and has a long latency communication profile –> moved to its own nodes.

Advantages of Mesh Topology

We can divert load with long latency profiles to other serverse.g Character Nodes, Markets, Fleet Management Nodes.

Using Mesh to Load Balance

Jita

SOL

SOL

SOL

SOL

SOLTopology change – moved Jita from a hierarchical position to a mesh, thereby increasing Information capacity.

Fischer Consensus ProblemConsensus – agreement on shared values – is a critical problem when it occurs in distributed systems.

“It is impossible to guarantee that a set of asynchronously connected processes can agree on even a single bit value.”

“Impossibility of distributed consensus with one faulty process” Fischer, Lynch, Paterson 1985

http://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf

“Consensus: The big Misunderstanding.”Guerraoui, Schiper

http://www.cs.cornell.edu/Courses/cs734/2000fa/cached papers/gs97.pdf

http://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf

http://www.cs.cornell.edu/Courses/cs734/2000fa/cached%20papers/gs97.pdf

Origin of the Consensus Problem

Fischer’s Consensus problem showed that it is impossible to distinguish between a delayed message and a lost one

Lost messages are equivalent to faulty processes

“Guarantee” - Most runs will achieve consensus.

But there is no possible protocol that can be guaranteed to converge

The larger the distributed system – and the more messages/hops involved the higher the probability of a failure or a delay.

Introduced by Message Handling

Message from A

Message from B

{Messages from unrelated tasks

• Reality of today’s computing environment is multiple layers of message queuing• As performance degrades, message delays increase, and the probability of a consensus

problem increases• Very hard to debug in real time systems, since debugging changes timing.

Symptoms of Consensus Problems

1. Sleep(1) - Reduces probability of delayed message2. Code complexity

Locally analysing every possible ‘delayed/lost message’ instance, and writing code to handle it

3. Regrettably this merely introduces more occurrences of the problem In the limit the developer goes quietly crazy

Note: Even with theoretically reliable processes, guaranteeing consensus is non-trivial and requires multiple message rounds

Which reduces available Information capacity

Solutions

• Consensus will usually occur, but can’t ever be guaranteed• The solutions are:

– Error recovery : “The end user will recover”– Synchronization

• But, we know that synchronizing at a single point will cause potential congestion.

Jumping typically involves several nodes

30s to resolve all cluster calls should be plenty…

Consensus Problems in Eve

• Number one cause of stuck petitions– Technical Support “Stuck Queue”– Game Master support for the “the end user will recover” solution.– Software design can certainly minimize them, but beyond a certain

point they have to be worked around.– Solutions to consensus all involve different kinds of performance trade

offs.

“Solutions” : Consensus, who needs it?

• It is possible to build systems that work with no consensus guarantees at all– E.g. Google, Facebook.

• Sensor network solutions have been proposed which exploit the over supply of information relative to the problem space

• Raise the probability of agreement to as close to 1 as possible, despite the presence of failure

• “Probabilistic consensus algorithms”• Generally speaking, these solutions can be made to work, but they typically

rely on multiple rounds of message passing, that are expensive for large systems

Eve Solution: Cache Early Cache Often

• All nodes cache data in EVE– SOLS cache to protect the Database from Load– Proxies cache to protect the SOLS– Clients cache to protect the Proxies

• Cached data is not synchronised, so at any given instance different nodes may have different views of some data

• Programmers can control relatively how out of date the view is

Solutions: Synchronisation• Shared clocks – literal synchronization

– Atomic clocks provide highly accurate synchronized time (Stratum 0)– The reliability of timing information provides extra local information that can be used to resolve

consensus failure.– Practically, network time based synchronization has its own problems and works best with long

latency systems.

• Lamport Timestamps– Simple algorithm that provides a partial ordering of events in a distributed system– Vector clocks– Both ways of providing a logical ordering of events/messages in a distributed system

"Time, clocks, and the ordering of events in a distributed system”Leslie Lamport, 1978

http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html

Solutions: Design

• Consensus requires synchronisation– Avoid the requirement for consensus where ever possible– Critical – will player‘s notice or care about inconsistent state?

• If consensus is required – a single server is needed somewhere to handle it

• Where possible design consensus to be a long latency process– Minimises probability of occurance.

Pod Kills in Last Hour

EVE Fleet Fights

First players converge on a node

Sometimes they try to do this all at once.

Load is distributed as much as possible over other cluster machines.

Latency Analysis

• Typical per player load on the cluster is < 0.5 msg/s• 30s Jump Timer is a game mechanic that provides long latency• During a fleet fight this goes up to > 1 msg/s

– Dedicated nodes used for known hot spots– Load balancing over evolution of game

• CPU load from fleet fight calculations has high computational latency– Highly optimized by CCP’s lag fighting Gridlock team.

• Load on proxies is engineered with spare capacity for bursts– Traffic measured on proxies is well distributed and stable

• Non-real time traffic is hosted on separate nodes

EVE Fleet Fights

For large numbers of players though, we’re always fighting the real time limits.

Design Approach

• Calculate load as calls per second/player and roughly estimate the number of players – How much CPU is each call going to require to process?– How much Network bandwidth?

• Is the design “embarrassingly parallel” or can it be made so?– Does it have high latency?– Some form of hierarchical approach should work

• If not - Is the problem solvable at all?– Mesh approach should work – Don’t try to solve the Fischer consensus problem!

• Can I change the problem and solve it at a higher level?– Fortunately Games give us enormous freedom to do this.

Takeaway.

• Very large scale distributed computing is a design problem.• Have a Model for the requirements of your system

– Multiple issues have the same symptom

• Limits apply at all levels of abstraction• The only systems that scale arbitrarily are the ones that don’t

communicate with each other.• Know the limits applicable for your application.• Don’t try to guarantee consensus.

Any Questions?

scaling the single universe jacky mallett, distributed systems architect [email protected]

Documents