Building Reliable Systems From Unreliable Parts

download Building Reliable Systems From Unreliable Parts

If you can't read please download the document

Transcript of Building Reliable Systems From Unreliable Parts

Building Reliable Systems From Unreliable Parts

You might think that the problems at large scale are different from the problems at small scale, but it's all failure all the time, and it is only worse at large scale

Plan for it.

All systems fail

Tell the story about the failure last week when a developer pushed a small Java warmup script and took out all of netflix api servers.

Jonah Horowitz

(Site Reliability Engineer at Netflix and elsewhere)

Home built BBS in 1990Some NOC/Helpdesk workWalmart.com in 2000BSEE from the Univ of CincinnatiMusic Startup in 2005Telecom Startup in 2007Advertising companiesNetflix

Talk about Netflix scale 100k servers, 80M users, 30TB/s network traffic, 800 microservices, 1500 engineers

Chaos is your friend

Talk about Chaos Monkey, Chaos Kong

Stateless services are awesome

Most of the 800 microservices at Netflix are stateless, this allows for failure

store state somewhere

Globally replicated cassandra database rings, massive number of nodes, but you should have 2 copies of your database.

Repair Automatically

Never have an engineer do something that can be done automatically. Computers are better at pushing puttons than you are.

Talk about rebootageddon, zero downtime even though 1/3 of our cassandra servers were rebooted over 48 hours.

Even in open source projects.

Culture is important

Jonah HorowitzSite Reliability Engineer

@[email protected]

Netflix lawyers didn't approve my talk, so everything I said was my own opinion.

Speakers here were really inspiring.