Post on 20-May-2015
description
Effective SOALessons from Amazon, Google, and Lucidchart
By Derrick Isaacson
Can I get that
without the bacon?
Said no one ever
http://www.food.com/photo-finder/all/bacon?photog=1072593
http://baconipsum.com/?paras=1&type=all-meat&start-with-lorem=1
http://www.someecards.com/usercards/viewcard/MjAxMi03YWZiMjJiMTg3NDFhYTUy
Simplicity of Single Component Services
• I can’t remember if that getter function takes 100ns or 100ms. - Said no engineer ever• Should I try to model this server request as a “remote procedure call”?• 6 orders of magnitude difference!
•My front-side bus fails for only 1 second every 17 minutes! - Said no engineer ever• 99.9% availability
•Our internet only supports .NET. - Said no engineer ever• Do we need an SDK?
"A distributed system is at best a necessary evil, evil because of the extra complexity...An application is rarely, if ever, intrinsically distributed. Distribution is just the lesser of the many evils, or perhaps better put, a sensible engineering decision given the trade-offs involved."
-David Cheriton, Distributed Systems Lecture Notes, ch. 1
Distributed System ArchitecturesDoes it have to be “Service-oriented”?
http://upload.wikimedia.org/wikipedia/commons/d/da/KL_CoreMemory.jpg
Distributed Memory
RPC
<I’m> <not> <making> <a> <service> <request>
<I’m> <just> <calling> <a> <procedure>
Distributed File System
mount -t nfs -o proto=tcp,port=2049 nfs-server:/ /mnt
Distributed Data Stores
• Replated MySQL• Mongo• S3• RDS• BigTable• Cassandra…
P2P
Streaming Media
Service-oriented ArchitecturesSocial Bookmarking App
GET /profiles/123
GET /users/123
Calculate something
GET /users/123/permissions
If user can’t view profile
send 403
POST /eventFeed {new profile view}
GET /users/123/friends
GET /bookmarks?userId=123
GET /catalog/books?ids=1,3,10
Calculate something else
GET /bookmarks/trending
Send response
Lucidchart.com by Status Code
96.5%2xx or3xx
Lucidchart.com 1s+ Latencies
10.8%> 1s
What Happened?!?I though SOA was supposed to make my app better!
Simple SOA Availability
<98.7%
99.5%
99.8%
99.6%
.995 * .998 * .998 * .996 = 0.987
A distributed system is at best a necessary evil
<98.7%
99.5%
99.8%
99.6%
The CAP Theorem
http://learnyousomeerlang.com/distribunomicon
The CAP Theorem1
• Safety – nothing bad ever happens
• Liveness – good things happen
• Unreliability – network dis-connectivity, crash failures, message loss, Byzantine failures, slowdown, etc.
• Consistency – every response sent to a client is correct
• Availability – every request gets a response
• Partition tolerance – operating in the face of arbitrary failures
Consistency: Nothing Bad Happens
Assumption: Failures Happen
Availability Consistency
ResponseHandler<User> handler = new ResponseHandler<User>(){
public User handleResponse(final HttpResponse response) {int status = response.getStatusLine().getStatusCode();if (status >= 200 && status < 300) {
HttpEntity entity = response.getEntity();return entity != null ? Parser.parse(entity) : null;
} else {…
}}
};
HttpGet userGet = new HttpGet("http://example.com/users/123");User user = httpclient.execute(userGet, handler);
…except it
doesn’t 1
0 of every 1000
requests
https://hc.apache.org/httpcomponents-client-4.3.x/examples.html
Works great to calculate a user!
GET /profiles/123
GET /users/123
Calculate something
GET /users/123/permissions
If user can’t view profile
send 403
POST /eventFeed {new profile view}
GET /users/123/friends
GET /bookmarks?userId=123
GET /catalog/books?ids=1,3,10
Calculate something else
GET /bookmarks/trending
Send response
Best Effort Availability -Euphemism for not always available
Best Effort Consistency -Euphemism for not always consistent
Google File System: relaxed consistency model
Throughput
Latency
Amazon Checkoutx http://highscalability.com/amazon-architecture
“WOWI really regret
sacrificing consistency for
availability”
-said no amazon ever
That’s $74 Billion
Hang Consistency!
Add• Caching• Timeouts• Retries•Guessing• Anything!
Tip 1:HTTP Caching
Availability/Performance Consistency
Tip 2: HTTP Caching as Fallback
Tip 3: Retries
• Exponential backoffs & max retries
Tip 3: HTTP Caching Technologies
• Apache HttpComponents – HttpClient Cache• Ehcache• Redis•Memcached• CloudFront• Akamai• Berkeley DB• AWS SNS (for notifying caches components of changes)
Segmenting Consistency and Availability1. Data Partitioning
Shopping Cart
Warehouse Inventory DB
Segmenting2. Operation Partitioning
Reads
Writes
Dynamo
PNUTS&
Segmenting3. Functional partitioning
User Service, Document Snapshots
Document Service
Segmenting4. Hierarchical Partitioning
Leaves
Root
http://www.slashgear.com/google-data-center-hd-photos-hit-where-the-internet-lives-gallery-17252451/
Timeouts
Stop Guessing and Just Calculate It
• Max I/O wait time = # of threads * (CONNECT_TIMEOUT + READ_TIMEOUT)• 9 front end servers received 1900 requests in 60 seconds
and 300 for Flickr resources (16%).• 35 requests per server per minute• Max 100 threads, => 6,000 thread seconds in one minute• Goal: ensure < 10% of thread seconds spent blocked on
Flickr I/O• 600 < 35 requests * (CONNECT_TIMEOUT +
READ_TIMEOUT)• CONNECT_TIMEOUT + READ_TIMEOUT < 17 seconds
TCP Connect
Send
Request Block on socket read Read response
CONNECT_TIMEOUT READ_TIMEOUT
Best Effort Consistency System
99.9%
99.5%
99.8%
99.6%
Wow, my pizza has too much
cheese and toppings
Said no one ever
http://upload.wikimedia.org/wikipedia/commons/6/60/Pizza_Hut_Meat_Lover's_pizza_3.JPG
“WOWMy system has
too muchcaching,
timeouts, and availability.”
-said no one ever
Questions?
golucid.co
http://www.slideshare.net/DerrickIsaacson
References
1. Perspectives on the CAP Theorem2. Bacon Ipsum3. Brewer’s Conjecture and the Feasibility of Consistent
, Available, Partition-Tolerant Web4. The Google File System5. Big Table6. Amazon Architecture References7. Apache HttpComponents8. Apache HttpClient Cache9. Ehcache