Load balancing theory and practice

Load balancingtheory and practice

Welcome

Me:• Dave Rosenthal• Co-founder of FoundationDB• Spent last three years building a distributed

transactional NoSQL database• It’s my birthday

Any time you have multiple computers working on a job, you have a load balancing problem!

Warning

There is an ugly downside to learning about load balancing: TSA checkpoints, grocery store lines, and traffic lights may become even more frustrating.

What is load balancing?

Wikipedia: “…methodology to distribute workload across multiple computers … to achieve optimal resource utilization, maximize throughput, minimize response time, and avoid overload”

All part of the latency curve

The latency curve

Series11

10

100

1000

10000

Jobs/second

Late

ncy

Overload

Saturation

Nominal Interesting

Goal for real-time systems

Series11

10

100

1000

10000

Jobs/second

Late

ncy

Low latency at given load

Goal for batch systems

Series11

10

100

1000

10000

Jobs/second

Late

ncy High Jobs/sec at a

reasonable latency

The latency curve

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 11

10

100

1000

Load

Late

ncy

(ms)

Better load balancing strategies can dramatically improve both latency and throughput

Load balancing tensions

• We want to reduce queue lengths in the system to yield better latency

• We want to lengthen queue lengths to keep a “buffer” of work to keep busy during irregular traffic and yield better throughput

• For distributed systems, equalizing queue lengths sounds good

Can we just limit queue sizes?

0 2 4 6 8 10 12 14 16 18 200

5

10

15

20

25

30

35

40

Queued job limit

% o

f dro

pped

jobs

Simple strategies

Global job queue: for slow tasksRound robin: for highly uniform situationsRandom: probably won’t screw youSticky: for cacheable situationsFastest of N tries: tradeoff throughput for latency. I recommend N = 2 or 3.

Use a global queue if possible

1 2 3 4 5 6 7 8 9 100.1

1

10

Random assignmentGlobal Job Queue

Cluster Size

Late

ncy

unde

r 80%

load

Options for information transfer

• None (rare)• Latency (most common)• Failure detection• Explicit– Load average– Queue length– Response times

FoundationDB’s approach

1. Request to random of three servers2. Server either answers query or replies “busy” if its queue

is longer than the queue limit estimate3. Queries that were busy are sent to second random server

with “must do” flag set.

Queue limit = 25 * 2^(20*P)• A global queue limit is implicitly shared by estimating the

fraction of incoming requests (P) that are flagged “must do”• Converges to a P(redirect)/queue-size equilibrium

FDB latency curve before/after

0 200000 400000 600000 800000 1000000 12000000.1

1

10

100

Operations per second

Late

ncy

0 200000 400000 600000 800000 1000000 12000000.1

1

10

100

Operations per second

Late

ncy

Tackling load balancing

• Queuing theory: One useful insight• Simulation: Do this• Instrumentation: Do this• Control theory: Know how to avoid this• Operations research: Read about this for fun– Blackett: Shield planes where they are not shot!

The one insight: Little’s law

Q = R*W

• (Q)ueue size = (R)ate * (W)ait-time• Q is the average number of jobs in the system• R is the average arrival rate (jobs/second)• W is the average wait time (seconds)• For any (!) steady-state systems– Or sub-systems, or joint systems, or…

Little’s law example 1

Q = R*W

• We get 1,000,000 request per second (R=1E6)• We take 100 ms to service each request• (Q = 1E6*0.100)• Little’s Law: Average queue depth is 100,000!

Little’s law example 2

W = Q/R

• We have 100 users in the system making continuous requests (Q=100)

• We get 10,000 requests per second• (W = 100 / 10,000)• Little’s Law: Average wait time is 10 ms

Little’s law ramifications

Q = R*W

• In distributed system:– R scales up– W remains the same, or gets a bit worse

• To maintain performance, you’re going to need a whole lot of jobs in flight

The rest of queuing theory

Erlang• A language • A man (Agner Krarup Erlang)• And a unit! (Q from little’s law AKA offered load

is measured in dimensionless Erlang units)• Erlang-B formula (for limited-length queues)• Erlang-C formula (P(waiting))

Abandon hope

Series11

10

100

1000

10000

Math for queuing theory

Real-world applicability

Com

plex

ity o

f Mat

h

Little’s law ?

Simulation

The best way to explore distributed system behavior

Quiz

Model: Jobs of random durations. 80% load.Goal: Minimize average job latency.

What to work a bit more on?• First task received• Last task received• Shortest task• Longest task• Random task• Task with least work remaining• Task with most work remaining

Simulation code snippits

Simulation results at 80% load

Task with most work remaining

Task with least work remaining

Random task

Longest task

Shortest task

Last task received

First task received

0 5 10 15 20 25 30 35 40 45 50

Latency

Simulation results at 95% load

Task with most work remaining

Task with least work remaining

Random task

Longest task

Shortest task

Last task received

First task received

10 100 1000 10000 100000

Latency

FoundationDB’s approach

• Strategy validated using simulation used for a single server’s fiber scheduling

• High priority: Work on the next task to finish• But be careful to enqueue incoming work from

the network with highest priority—we want to know about all our jobs to make good decisions

• Low priority: Catch up with housekeeping (e.g. non-log writing)

Load spikes

Low load system High load system

Series1

Series1

Bursts of job requests can destroy latency. The effect is quadratic: A burst produces a queue of size B that lasts time proportional to B. On highly-loaded systems, the effect is multiplied by 1/(1-load), leading to huge latency impacts.

Burst-avoiding tip

1. Search for any delay/interval in your system2. If system correctness depends on the

delay/interval being exact, first fix that3. Now change that delay/interval to randomly

wait 0.8-1.2 times the nominal time on each execution

YMMV, but this tends to diffuse system events more evenly in time and help utilization and latency.

Overload

Series11

10

100

1000

10000

Jobs/second

Late

ncy

Overload

Overload

What happens when work comes in too fast?• Somewhere in your system a queue is going to

get huge. Where?• Lowered efficiency due to:– Sloshing– Poor caching

• Unconditional acceptance of new work means no information transfer to previous system!

Overload (cont’d): Sloshing

Loading 10 million rows into popular NoSQL K/V store shows sloshing

12.5 minutes

Overload (cont’d): No sloshing

Loading 10 million rows into FDB shows smooth behavior:

System queuing

Work

A

B

C

D

E

Node 1

Queue

Node 2

Queue

Node 3

Queue

System queuing

Work

B

C

D

E

Node 1

Queue

A

Node 2

Queue

Node 3

Queue

Internal queue buildup

Work

E

Node 1

Queue

A

B

C

D

Node 2

Queue

Node 3

Queue

Even queues, external buildup

Work

D

E

…

Node 1

Queue

C

Node 2

Queue

B

Node 3

Queue

A

Our approach

“Ratekeeper”• Active management of internal queue sizes prevents

sloshing• Avoids every subcomponent needing it’s own well-

tuned load balancing strategy• Explicitly send queue information at 10hz back to a

centrally-elected control algorithm• When queues get large, slow system input• Pushes latency into an external queue at the front of

the system using “tickets”

Ratekeeper in action

0 100 200 300 400 500 6000

200000

400000

600000

800000

1000000

1200000

1400000

Seconds

Ope

ratio

ns p

er s

econ

d

Ratekeeper internals

What can go wrong

Well, we are controlling the queue depths of the system, so, basically, everything in control theory…

Namely, oscillation:

Recognizing oscillation

• Something moving up and down :)– Look for low utilization of parallel resources– Zoom in!

• Think about sources of feedback—is there some way that having a machine getting more job done feeds either less or more work for that machine in the future? (probably yes)

What oscillation looks like

1 1.5 2 2.5 3 3.5 4 4.5 50

10

20

30

40

50

60

70

Node ANode B

Util

izati

on %

What oscillation looks like

2 2.05 2.1 2.15 2.2 2.25 2.30

20

40

60

80

100

120

Node ANode B

Util

izati

on %

Avoiding oscillation

• This is control theory—avoid if possible!• The major thing to know: control gets harder

at frequencies get higher. (e.g. Bose headphones)

• Two strategies:– Control on a longer time scale– Introduce a low-pass-filer in the control loop (e.g.

exponential moving average)

Instrumentation

If you can’t measure, you can’t make it better

Things that might be nice to measure:• Latencies• Queue lengths• Causes of latency?

Measuring latencies

Our approach:• We want information about the distribution, not

just the average• We use a “Distribution” class– addSample(X)– Stores 500+ samples– Throws away half of them when it hits 1000 samples,

and halves probability of accepting new samples– Also tracks exact min, max, mean, and stddev

Measuring queue lengths

Our approach:• Track the % of time that a queue is at zero length• Measure queue length snapshots at intervals• Watch out for oscillations– Slow ones you can see– Fast ones look like noise (which, unfortunately, is

also what noise looks like)– “Zoom in” to exclude the possibility of micro-

oscillations

Measuring latency from blocking

• Easy to calculate:– L = (b0^2 + b1^2 … bN^2) / elapsed – Total all squared seconds of blocking time over

some interval, divide by the duration of the interval. • Measures impact of unavailability on mean

latency from random traffic• Example: Is server’s slow latency explained by

this lock?• Doesn’t count catch-up time.

Summary

[email protected]

Thanks for listening, and remember:• Everything has a latency curve• Little’s law• Randomize regular intervals• Validate designs with simulation• Instrument

May your queues be small, but not empty

mailto:[email protected]

Prioritization/QOS

• Can help in systems under partial load• Vital in systems that handle batch and real-

time loads simultaneously• Be careful that high priority work doesn’t

generate other high priority work plus other jobs in the queue. This can lead to poor utilization analogous to the internal queue buildup case.

Congestion pricing

• My favorite topic• Priority isn’t just a function of the benefit of

your job• To be a good citizen, you should subtract the

costs to others• For example, jumping into the front of a long

queue has costs proportional to the queue size

Other FIFO alternatives?

• LIFO– Avoids the reason to line up early– In situations where there is adequate capacity to

serve everyone, can yield better waiting times for everyone involved

Load balancing theory and practice

Technology

Transcript of Load balancing theory and practice