Post on 25-May-2015
description
BoxcarA self-balancing distributed services protocol
R.B. BoyerSoftware Engineer
Resume
I help people get jobs.
I solveinteresting problems.
Boxcar was the solutionto a problem:
Building
How we build products
Simple
Fast
Comprehensive
Relevant
How we build products
Simple
Fast
Comprehensive
Relevant
How we build systems
Simple
Fast
Resilient
Scalable
“I want my application to be more complicated”
- No one ever
Simple
Complexity creates confusion
Complexity creates confusion
Confusion breeds bugs
“I want my application to be slower”
- No one ever
Fast
conducted a speed test
+500 milliseconds of latency per search
20% fewer searches
Speed is a feature
“I want my users to experience outages”
- No one ever
Resilient
Programs crash
Programs crash
Machines die
Minimize vulnerability toany failure
“My system will only need to support 10 users”
- No one ever
Scalable
Scale with MORE machines
Scale with MORE machines
Not BIGGER machines
TL;DR:
Jobs Sites Job
Seekers
Indeed
Aggregation Job Search
Jobs Sites Job
Seekers
Aggregation
Job Search
Aggregation
Job Search
Challenge!
Challenge: SimpleFastResilientScalable
keep this
Options:
Share data access?
Example:Shared Database
Main Database
Shared Database
Main Database
Main Application
Shared Database
Main Database
Main Application
Analysis Tool
Shared Database
Main Database
Main Application
Analysis Tool
Shared Database
BillingApplication
Main Database
Main Application
Analysis Tool
BillingApplication
Intern Project
Shared Database
Main Database
Main Application
Analysis Tool
BillingApplication
Intern Project
Other Intern Project
Shared Database
Main Database
Main Application
Analysis Tool
BillingApplication
Intern Project
Other Intern Project
Email Tool
Shared Database
Main Database
Main Application
Analysis Tool
BillingApplication
Intern Project
Other Intern Project
Email Tool
Shared Database
This is an anti-pattern
On a long enough timeline...
Maintenance Nightmare
Share data access
Share data accessInsulate data from
consumers
Main Database
Main Application
Analysis Tool
BillingApplication
Intern Project
Other Intern Project
Email Tool
Shared Database
Main Application
Analysis Tool
BillingApplication
Intern Project
Other Intern Project
Email Tool
Insulated Database
Main Database
Main Service
Service?
Service
ClientClientClientClient
Service
NET
WO
RK
ClientClientClientClient
Service
NET
WO
RK Icky
TechnicalStuff
ClientClientClientClient
Service
Caches
Databases
Logging
Business Logic
...
Icky TechnicalDetails
Client API
ClientClientClientClient
NET
WO
RK
Client API
Service.getJobs([12345, 62])
Icky Technical Details
SELECT * FROM jobs AS jLEFT JOIN companyinfo AS ci ON j.id=ci.job_idLEFT JOIN locations AS loc ON loc.id=j.location_idWHERE j.id IN (12345, 62)
Service Oriented
Architecture
ServiceOriented
Architecture
Boxcar
Boxcar is a...self-balancing
distributed
services
protocol
Origin Story
There was a life before Boxcar
There were services before Boxcar
Pick one:
Doc Service
Document Serving Serviceaka “Doc Service”
http://go.indeed.com/docservice
Doc Service controlsaccess to JOBS
Building Blocks
Webapp
DocService
Docstore
Wants jobs
Controls access to jobs
Stores jobs
Build it
Webapp
Webapp
Doc Service
Docstore
Webapp
Doc Service
Docstore
Webapp
Doc Service
Docstore
Mission Accomplished
But is it good?
How we build systems
Simple
Fast
Resilient
Scalable
Goodness Metric
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable
Webapp
Doc Service
Docstore
Webapp
Doc Service
Docstore
Webapp
Doc Service
Docstore
Resilient✘
Add Resilience
Webapp
Doc Service
Docstore
Webapp
Doc Service
Docstore
Webapp
Doc Service
Docstore
Front-end Load Balancer
Webapp
Doc Service
Docstore
Webapp
Doc Service
Docstore
Siloed Stacks
Webapp
Doc Service
Docstore
Webapp
Doc Service
Docstore
Front-end Load Balancer
Siloed Stacks
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable
✓
✓
?✓
Scaling Silos
Webapp
Doc Service
Docstore
Webapp
Doc Service
Docstore
Front-end Load Balancer
Scaling Silos
Webapp Webapp
Doc Service
Docstore
Webapp
Doc Service
Docstore
Front-end Load Balancer
Scaling Silos
Webapp
Doc Service
Docstore
Webapp
Doc Service
Docstore
Webapp
Doc Service
Docstore
Front-end Load Balancer
Need bigger and
bigger machines
Vertical Scaling
Siloed Stacks
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable✘
✓
✓
✓
Siloed Stacks
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable✘
✓ Services Version 1
✓
✓
Improve scalability
Webapp
Doc Service
Docstore
Webapp
Doc Service
Docstore
Webapp
Doc Service
Docstore
Front-end Load Balancer
Webapp WebappWebapp
Doc Service
Docstore
Doc Service
Docstore
Front-end Load Balancer
Doc Service Load Balancer
Per-Service Balancer
Webapp WebappWebapp
Doc Service
Docstore
Doc Service
Docstore
Front-end Load Balancer
Doc Service Load Balancer
Per-Service Balancer
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable
~
✓
??
2x Bandwidth✘
Webapp
Doc Service
Doc Service Load Balancer
Proxying isn’t free
Per-Service Balancer
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable
~
✓
?✘
Resilience
Webapp WebappWebapp
Doc Service
Docstore
Doc Service
Docstore
Front-end Load Balancer
SINGLE POINT OF FAILURE
Need two balancers
Need two balancers
...and a way to balance between them?
Load Balancer Balancing
Master / Slave
Share IP address
Heartbeat between nodes
Complex
Resilience
Webapp WebappWebapp
Doc Service
Docstore
Doc Service
Docstore
Front-end Load Balancer
Doc Service Load BalancerDoc Service Load Balancer
Best explained by our Operations folks:
“Redundant Array of Inexpensive Datacenters”
http://go.indeed.com/raid
Per-Service Balancer
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable
~
✓
✓
✘
Per-Service Balancer
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable
~
✓
✓ Services Version 2
✘
Reduce network waste
Webapp WebappWebapp
Doc Service
Docstore
Doc Service
Docstore
Front-end Load Balancer
Doc Service Load Balancer
Webapp WebappWebapp
Doc Service
Docstore
Doc Service
Docstore
Front-end Load Balancer
Naive Round Robin
Webapp WebappWebapp
Doc Service
Docstore
Doc Service
Docstore
Front-end Load Balancer
Naive Round Robin
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable
✓
✓
??
Direct Connections
1x Bandwidth✓
Webapp
Doc Service
Naive Round Robin
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable
✓
✓
?
✓
Server A Server B
Server A Server B
Server A Server B
REQUEST
Server A Server B
REQUEST✘
Server A Server B
REQUEST✘
Server A Server B
REQUEST
Server A Server B
REQUEST
Naive Round Robin
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable✓
✓
✓
✓
Naive Round Robin
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable✓
✓
✓
✓
Naive Round Robin
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable
Balanced
✓
✓
?
✓
✓
Slow Fast
Slow Fast
Slow Fast
Slow Fast
Slow Fast
Slow Fast
Slow Fast
Slow Fast
Slow Fast
Slow Fast
Slow Fast
Can’t keep up
Slow Fast
Naive Round Robin
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable
Balanced
✓
✓
✓
✓
✘
Naive Round Robin
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable
Balanced
✓
✓
✘
NOPE
✓
✓
Ensure balance
Webapp WebappWebapp
Doc Service
Docstore
Doc Service
Docstore
Front-end Load Balancer
Doc Service Load Balancer
Webapp WebappWebapp
Doc Service
Docstore
Doc Service
Docstore
Front-end Load Balancer
Distribute!
Webapp WebappWebapp
Doc Service
Docstore
Doc Service
Docstore
Front-end Load Balancer
B BB
WebApp
WebApp
WebApp
Doc Service
Docstore
Doc Service
Docstore
Front-end Load Balancer
B B B
WebApp
WebApp
WebApp
Doc Service
Docstore
Doc Service
Docstore
Front-end Load Balancer
B B B
Boxcar
Naive Round Robin
Per-Service Balancer
The Boxcar balancing algorithm is simple
Servers assign numeric value to connections
Clients use the connection with the lowest numeric value to service each request
Gist
Server A
Server A
Server A
Server ASlot 0
Slot 1
Slot 2
Slot 3
Slot 4...
Just numbers
No limit
NOT a priority
ONLY used for balancing
Server ASlot 0
Slot 1
Slot 2
Slot 3
Slot 4
...
Slot Numbers
LOW slot numbers
are the
BEST slot numbers
Server ASlot 0 USEDSlot 1 USEDSlot 2
Slot 3
Slot 4 USED
...
Client 2
Hello!
Server ASlot 0 USEDSlot 1 USEDSlot 2 USEDSlot 3
Slot 4 USED
...
Client 2Slot 2
Client 2
Client 2
Client 2
Client 2Slot 0
Slot 2
Slot 12
Slot 29
Slot 30
Slot 57
Server
A
Server
B
Client 2Slot 0
Slot 2
Slot 12
Slot 29
Slot 30
Slot 57
Server
A
Server
B
long-lived connections
Clients are greedy
205
MINE!
Clients are greedy
205MINE!
Want best connections
Continually look for better connections
Close worst connections
Background thread maintains the
connection pool
Client 2Slot 0
Slot 2
Slot 12
Slot 29
Slot 30
Slot 57
Server
A
Server
B
Client 2Slot 0
Slot 2
Slot 12
Slot 29
Slot 30
Slot 57
Server
A
Server
B
Slot17
Client 2Slot 0
Slot 2
Slot 12
Slot 29
Slot 30
Server
A
Server
B
✘
Slot17
Slot 57✘
Client 2Slot 0
Slot 2
Slot 12
Slot 29
Slot 30
Server
A
Server
B
Slot17
Client 2Slot 0
Slot 2
Slot 29
Slot 30
Server
A
Server
B
Slot17
Slot 12
Client 2Slot 0
Slot 2
Slot 17
Slot 29
Slot 30
Server
A
Server
BSlot 12
Client 2Slot 0
Slot 2
Slot 17
Slot 29
Slot 30
Server
A
Server
BSlot 12
Continues forever
Client 2Slot 0 ACTIVESlot 2 ACTIVESlot 12 [idle]
Slot 29 ACTIVESlot 30 [idle]
Slot 57 [idle]
GetJobs()
Incoming Requests
Client 2Slot 0 ACTIVESlot 2 ACTIVESlot 12 ACTIVESlot 29 ACTIVESlot 30 [idle]
Slot 57 [idle]
GetJobs()
Incoming Requests
Connections NOTestablished on-demand
Client 2Slot 0 ACTIVESlot 2 ACTIVESlot 12 ACTIVESlot 29 ACTIVESlot 30 ACTIVESlot 57 ACTIVE
GetJobs()
Requests to Busy Pool
Client 2Slot 0 ACTIVESlot 2 ACTIVESlot 12 ACTIVESlot 29 ACTIVESlot 30 ACTIVESlot 57 ACTIVE
GetJobs()✘ERROR!
Requests to Busy Pool
Sizing the pool properly is imperative!
Servers assign numeric value to connections
Clients use the connection with the lowest numeric value to service each request
Gist Redux
Balanced load is emergent behavior
Load Balancing Simulations
Client XServer A
Server B
0Client XServer A
slot 0
Server B
Client X0
Server Aslot 0
Server B
Client X0
0
Server Aslot 0
Server Bslot 0
Client X00
Server Aslot 0
Server Bslot 0
1
Client X00
1
Server Aslot 0slot 1
Server B
slot 1slot 0
Steady-state balance
Client X00
Server Aslot 0
Server Bslot 0
Client X00
Server Aslot 0
Server Bslot 0
New Clients Join
Client Y
Client X00
Server Aslot 0
Server Bslot 0
Client Y
Client X00
1
1
Server Aslot 0slot 1
Server B
slot 1slot 0
Client Y11
Client X00
Server Aslot 0slot 1
Server B
slot 1slot 0
Client Y11
Client X00
2
2
Server Aslot 0slot 1slot 2
Server B
slot 1slot 2
slot 0
Client Y11
Client X00
Server Aslot 0slot 1
Server B
slot 1slot 0
Client Y11
Client Z
Client X00
Server Aslot 0slot 1
Server B
slot 1slot 0
Client Y11
Client Z
Client X00
2
2
Server Aslot 0slot 1slot 2
Server B
slot 1slot 2
slot 0
Client Y11
Client Z22
Client X00
Server Aslot 0slot 1slot 2
Server B
slot 1slot 2
slot 0
Client Y11
Client Z22
Client X00
3
3
Server Aslot 0slot 1slot 2
Server B
slot 1slot 2slot 3
slot 3
slot 0
Client Y11
Client Z22
Client X00
Steady-state balance
Server Aslot 0slot 1slot 2
Server B
slot 1slot 2
slot 0
Client Y11
Client Z22
Client X00
Server Aslot 0slot 1slot 2
Server B
slot 1slot 2
slot 0
Server Failure
Client Y1
Client Z2
Client X0
Server Aslot 0slot 1slot 2
Server Bslot 0
Client Y1
Client Z2
Client X0
Server Aslot 0slot 1slot 2
Server B
Client Y1
Client Z2
Client X0
Server Aslot 0slot 1slot 2slot 3
3
Server B
Client Y1
Client Z23
Client X0
Server Aslot 0slot 1slot 2slot 3
Server B
Client Y1
Client Z23
Client X0
Server Aslot 0slot 1slot 2slot 3
4
slot 4
Server B
Client Y1
Client Z23
Client X04
Server Aslot 0slot 1slot 2slot 3slot 4
Server B
Client Y1
Client Z23
Client X04
Server Aslot 0slot 1slot 2slot 3
5
slot 4slot 5
Server B
Client Y15
Client Z23
Client X04
Server Aslot 0slot 1slot 2slot 3slot 4slot 5
Server B
Client Y15
Client Z23
Client X04
Server Aslot 0slot 1slot 2slot 3slot 4slot 5
Steady-state balance
Server B
Client Y15
Client Z23
Client X04
Server Aslot 0slot 1slot 2slot 3slot 4slot 5
Steady-state balance
Server B
Client Y15
Client Z23
Client X04
Server Aslot 0slot 1slot 2slot 3slot 4slot 5
Server B
Server Restored
Client Y15
Client Z23
Client X04
Server Aslot 0slot 1slot 2slot 3slot 4slot 5
Server B
Client Y15
Client Z23
Client X04
Server Aslot 0slot 1slot 2slot 3slot 4slot 5
Server Bslot 0
0
Client Y15
Client Z2
0 < 3
Client X04
Server Aslot 0slot 1slot 2slot 3slot 4slot 5
Server Bslot 0
Client Y15
Client Z20
Client X04
Server Aslot 0slot 1slot 2slot 3slot 4slot 5
Server Bslot 0
Client Y15
Client Z20
Client X04
Server Aslot 0slot 1slot 2slot 3slot 4slot 5
Server Bslot 0
Client Y15
Client Z02
Client X04
Server Aslot 0slot 1slot 2slot 3slot 4slot 5
Server Bslot 0
Client Y15
Client Z02
Client X04
Server Aslot 0slot 1slot 2slot 3slot 4slot 5
Server B
slot 1slot 0
1
Client Y1
1 < 5
Client Z02
Client X04
Server Aslot 0slot 1slot 2slot 3slot 4slot 5
Server B
slot 1slot 0
Client Y11
Client Z02
Client X04
Server Aslot 0slot 1slot 2slot 3slot 4
Server B
slot 1slot 0
Client Y11
Client Z02
Client X04
Server B
slot 1slot 2
slot 0
2
Server Aslot 0slot 1slot 2slot 3slot 4
Client Y11
Client Z02
Client X0
2 < 4
Server B
slot 1slot 2
slot 0
Server Aslot 0slot 1slot 2slot 3slot 4
Client Y11
Client Z02
Client X02
Server B
slot 1slot 2
slot 0
Server Aslot 0slot 1slot 2
Client Y11
Client Z02
Client X02
Steady-state balance
Server B
slot 1slot 2
slot 0
Server Aslot 0slot 1slot 2
Client Y11
Client Z02
Client X02
Server B
slot 1slot 2
slot 0
Server Aslot 0slot 1slot 2
Client Shutdown
Client Y11
Client Z02
Client X02
Server B
slot 1slot 2
slot 0
Server Aslot 0slot 1slot 2
Client Z02
Client X02
Server B
slot 1slot 2
slot 0
Server Aslot 0slot 1slot 2
Client Z02
Client X02
1
Server B
slot 1slot 2
slot 0
Server Aslot 0slot 1slot 2
Client Z0
1 < 2
Client X02
Server B
slot 1slot 2
slot 0
Server Aslot 0slot 1slot 2
Client Z01
Client X02
Server B
slot 1slot 2
slot 0
Server Aslot 0
Client Z01
Client X021
Server B
slot 1slot 2
slot 0
Server Aslot 0slot 1
Client Z01
Client X0
1 < 2
Server B
slot 1slot 2
slot 0
Server Aslot 0slot 1
Client Z01
Client X01
Server B
slot 1slot 0
Server Aslot 0slot 1
Client Z01
Client X01
Steady-state balance
Server B
slot 1slot 0
Server Aslot 0slot 1
Client Z01
Client X01
Server B
slot 1slot 0
Server Aslot 0slot 1
Client Rejoins
Client Z01
Client X01
Client Y
Server B
slot 1slot 0
Server Aslot 0slot 1
Client Z01
Client X01
Client Y2
2
Server B
slot 1slot 2
slot 0
Server Aslot 0slot 1slot 2
Client Z01
Client X01
Client Y22
Server B
slot 1slot 2
slot 0
Server Aslot 0slot 1slot 2
Client Z01
Client X01
Server B
slot 1slot 2
slot 0
Server Aslot 0slot 1slot 2
Client Y22
Steady-state balance
Why does this Balance?
Connections are like running water
seeking lower ground
Slots
Servers
Connections
Slots
Servers
Connections
Slots
Servers
Connections
Slots
Servers
Connections
Slots
Servers
Connections
Slots
Servers
Connections
Slots
Servers
Connections
Slots
Servers
Connections
Slots
Servers
Connections
Slots
Servers
Connections
Slots
Servers
Connections
Slots
Servers
Connections
Slots
Servers
Roughly Equal Distribution
Connections
Edge cases
Client Z01
Client X01
Server B
slot 1slot 0
Server Aslot 0slot 1
Balancedbut not ideal
Client Z01
Client X01
Server B
slot 1slot 0
Server Aslot 0slot 1
Client Z
Client X01
Server Aslot 0slot 1
Server Bslot 0
Client Z
Client X01
Server Aslot 0slot 1
Server Bslot 0 EMPTY
POOL!
Client Z
Client X01
Server Aslot 0slot 1
Server Bslot 0 EMPTY
POOL!
Resilient✘
Fix by adding entropy
Fix by adding entropy
aka “Table Shaking”
Servers regularly hang up on connections
Table Shaking
Servers regularly hang up on connections
Clients expect failed connections
Table Shaking
Servers regularly hang up on connections
Clients expect failed connections
Failures are retried on new connections
Table Shaking
Servers regularly hang up on connections
Clients expect failed connections
Failures are retried on new connections
Bad configurations are less likely
Table Shaking
Table Shakingturns this
Client Z01
Client X01
Server B
slot 1slot 0
Server Aslot 0slot 1
Into this
Client Z01
Client X01
Server B
slot 1slot 0
Server Aslot 0slot 1
Client Z
1
Client X0
Server Aslot 0slot 1
Server Bslot 0
Client Z
1
Client X0
Server Aslot 0slot 1
Server Bslot 0
YAY!
YAY!
Balancing Tricks:
Handicapping
Handicapping isServer Self-quarantine
Handicapping
Exploit slot number assignment
Handicapping
Exploit slot number assignment
Unhealthy servers inflate slot numbers
Handicapping
Exploit slot number assignment
Unhealthy servers inflate slot numbers
Clients naturally avoid these servers
Slots
Servers
Connections
Slots
Servers
Connections
Unhealthy
Slots
Servers
Connections
Slots
Servers
Connections
Slots
Servers
Connections
Slots
Servers
Connections
Unhealthy
Slots
Servers
Connections
Slots
Servers
Connections
Slots
Servers
Connections
graceful degradation
Is Boxcar good?
Boxcar
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable
Balanced
✓
✓
✓
?
?
Clients are pessimistic
Clients are pessimistic
Failure is expected
Boxcar
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable
Balanced
✓
✓
✓
?
✓
Balance ConnectionsNot Requests
Balancing Review:
Naive Round Robin
Slow Fast
Slow Fast
Slow Fast
Slow Fast
Slow Fast
Slow Fast
Slow Fast
Slow Fast
Slow Fast
Slow Fast
Slow Fast
Can’t keep up
Slow Fast
The problem was thatrequests (connections)
piled up
Boxcar has a fixed number of connections
Boxcar has a fixed number of connections
there’s nothing to pile up
SlowServer
FastServer
SlowServer
FastServer
Client
Slot 9Slot 7
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
0 1
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
SlowServer
FastServer
Slot 9Slot 7
Client
7 9
2 requests 4 requests
Slow servers handle fewer requests
No overloaded servers
All requests are serviced
Load balancing is probabilistic
Boxcar
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable
Balanced
✓
✓
✓
✓
✓
Boxcar
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable
Balanced
✓
✓
✓
✓
✓
Good enough for Indeed
Services well over a
BILLION requestsevery day
Fundamental technology
Powering over 20 different services
In production since 2009
ServiceOriented
Architecture
Q & A