Effective SOA

Post on 20-May-2015

353 views 1 download

Tags:

description

It has been observed that "A distributed system is at best a necessary evil, evil because of the extra complexity." Multiple nodes computing on inconsistent state with regular communication failures present entirely different challenges than those computer science students face in the classroom writing DFS algorithms. The past 30 years have seen some interesting theories and architectures to deal with these complexities in what we now call "cloud computing". Some researchers worked on "distributed memory" and others built "remote procedure calls". More commercially successful architectures of late have popularized ideas like the CAP theorem, distributed caches, and REST. Using examples from companies like Amazon and Google this presentation walks through some practical tips to evolve your service-oriented architecture. Google's Chubby service demonstrates how you can take advantage of CAP's "best effort availability" options and Amazon's "best effort consistency" services show the other end of the spectrum. Practical lessons learned from Lucidchart's forays into SOA share insight through quantitative analyses on how to make your system highly available.

Transcript of Effective SOA

Effective SOALessons from Amazon, Google, and Lucidchart

By Derrick Isaacson

Can I get that

without the bacon?

Said no one ever

http://www.food.com/photo-finder/all/bacon?photog=1072593

http://www.someecards.com/usercards/viewcard/MjAxMi03YWZiMjJiMTg3NDFhYTUy

Simplicity of Single Component Services

• I can’t remember if that getter function takes 100ns or 100ms. - Said no engineer ever• Should I try to model this server request as a “remote procedure call”?• 6 orders of magnitude difference!

•My front-side bus fails for only 1 second every 17 minutes! - Said no engineer ever• 99.9% availability

•Our internet only supports .NET. - Said no engineer ever• Do we need an SDK?

"A distributed system is at best a necessary evil, evil because of the extra complexity...An application is rarely, if ever, intrinsically distributed. Distribution is just the lesser of the many evils, or perhaps better put, a sensible engineering decision given the trade-offs involved."

-David Cheriton, Distributed Systems Lecture Notes, ch. 1

Distributed System ArchitecturesDoes it have to be “Service-oriented”?

http://upload.wikimedia.org/wikipedia/commons/d/da/KL_CoreMemory.jpg

Distributed Memory

RPC

<I’m> <not> <making> <a> <service> <request>

<I’m> <just> <calling> <a> <procedure>

Distributed File System

mount -t nfs -o proto=tcp,port=2049 nfs-server:/ /mnt

Distributed Data Stores

• Replated MySQL• Mongo• S3• RDS• BigTable• Cassandra…

P2P

Streaming Media

Service-oriented ArchitecturesSocial Bookmarking App

GET /profiles/123

GET /users/123

Calculate something

GET /users/123/permissions

If user can’t view profile

send 403

POST /eventFeed {new profile view}

GET /users/123/friends

GET /bookmarks?userId=123

GET /catalog/books?ids=1,3,10

Calculate something else

GET /bookmarks/trending

Send response

Lucidchart.com by Status Code

96.5%2xx or3xx

Lucidchart.com 1s+ Latencies

10.8%> 1s

What Happened?!?I though SOA was supposed to make my app better!

Simple SOA Availability

<98.7%

99.5%

99.8%

99.6%

.995 * .998 * .998 * .996 = 0.987

A distributed system is at best a necessary evil

<98.7%

99.5%

99.8%

99.6%

The CAP Theorem

http://learnyousomeerlang.com/distribunomicon

The CAP Theorem1

• Safety – nothing bad ever happens

• Liveness – good things happen

• Unreliability – network dis-connectivity, crash failures, message loss, Byzantine failures, slowdown, etc.

• Consistency – every response sent to a client is correct

• Availability – every request gets a response

• Partition tolerance – operating in the face of arbitrary failures

Consistency: Nothing Bad Happens

Assumption: Failures Happen

Availability Consistency

ResponseHandler<User> handler = new ResponseHandler<User>(){

public User handleResponse(final HttpResponse response) {int status = response.getStatusLine().getStatusCode();if (status >= 200 && status < 300) {

HttpEntity entity = response.getEntity();return entity != null ? Parser.parse(entity) : null;

} else {…

}}

};

HttpGet userGet = new HttpGet("http://example.com/users/123");User user = httpclient.execute(userGet, handler);

…except it

doesn’t 1

0 of every 1000

requests

https://hc.apache.org/httpcomponents-client-4.3.x/examples.html

Works great to calculate a user!

GET /profiles/123

GET /users/123

Calculate something

GET /users/123/permissions

If user can’t view profile

send 403

POST /eventFeed {new profile view}

GET /users/123/friends

GET /bookmarks?userId=123

GET /catalog/books?ids=1,3,10

Calculate something else

GET /bookmarks/trending

Send response

Best Effort Availability -Euphemism for not always available

Best Effort Consistency -Euphemism for not always consistent

Google File System: relaxed consistency model

Throughput

Latency

Amazon Checkoutx http://highscalability.com/amazon-architecture

“WOWI really regret

sacrificing consistency for

availability”

-said no amazon ever

That’s $74 Billion

Hang Consistency!

Add• Caching• Timeouts• Retries•Guessing• Anything!

Tip 1:HTTP Caching

Availability/Performance Consistency

Tip 2: HTTP Caching as Fallback

Tip 3: Retries

• Exponential backoffs & max retries

Tip 3: HTTP Caching Technologies

• Apache HttpComponents – HttpClient Cache• Ehcache• Redis•Memcached• CloudFront• Akamai• Berkeley DB• AWS SNS (for notifying caches components of changes)

Segmenting Consistency and Availability1. Data Partitioning

Shopping Cart

Warehouse Inventory DB

Segmenting2. Operation Partitioning

Reads

Writes

Dynamo

PNUTS&

Segmenting3. Functional partitioning

User Service, Document Snapshots

Document Service

Segmenting4. Hierarchical Partitioning

Leaves

Root

http://www.slashgear.com/google-data-center-hd-photos-hit-where-the-internet-lives-gallery-17252451/

Timeouts

Stop Guessing and Just Calculate It

• Max I/O wait time = # of threads * (CONNECT_TIMEOUT + READ_TIMEOUT)• 9 front end servers received 1900 requests in 60 seconds

and 300 for Flickr resources (16%).• 35 requests per server per minute• Max 100 threads, => 6,000 thread seconds in one minute• Goal: ensure < 10% of thread seconds spent blocked on

Flickr I/O• 600 < 35 requests * (CONNECT_TIMEOUT +

READ_TIMEOUT)• CONNECT_TIMEOUT + READ_TIMEOUT < 17 seconds

TCP Connect

Send

Request Block on socket read Read response

CONNECT_TIMEOUT READ_TIMEOUT

Best Effort Consistency System

99.9%

99.5%

99.8%

99.6%

Wow, my pizza has too much

cheese and toppings

Said no one ever

http://upload.wikimedia.org/wikipedia/commons/6/60/Pizza_Hut_Meat_Lover's_pizza_3.JPG

“WOWMy system has

too muchcaching,

timeouts, and availability.”

-said no one ever

Questions?

golucid.co

http://www.slideshare.net/DerrickIsaacson