Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013
-
Upload
puppet-labs -
Category
Technology
-
view
21.659 -
download
1
description
Transcript of Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013
![Page 1: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013](https://reader033.fdocuments.us/reader033/viewer/2022060122/55951e551a28ab0c4b8b45f2/html5/thumbnails/1.jpg)
Why did we think large scale distributed systems would be
easy? Gordon Rowell
PuppetConf San Francisco 2013
![Page 2: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013](https://reader033.fdocuments.us/reader033/viewer/2022060122/55951e551a28ab0c4b8b45f2/html5/thumbnails/2.jpg)
Background
Site Reliability Engineering runs many services The same rules always apply:
● Make the service scale ● Make the deployment consistent ● Understand all layers of the system ● Monitor everything ● Plan for failure ● Break things, under controlled conditions
![Page 3: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013](https://reader033.fdocuments.us/reader033/viewer/2022060122/55951e551a28ab0c4b8b45f2/html5/thumbnails/3.jpg)
Scaling is fun
We don't deploy "a server" • Servers break, power fails • Clients/DNS need to be reconfigured
We don't deploy "a cluster"
• Networks break, servers break, power fails • Clients/DNS need to be reconfigured
We deploy redundant clusters
• Attempt to send clients to nearest serving cluster • Anycast allows for unified client configuration
![Page 4: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013](https://reader033.fdocuments.us/reader033/viewer/2022060122/55951e551a28ab0c4b8b45f2/html5/thumbnails/4.jpg)
But client DoS is not
Poorly written code... ● on small numbers of clients... ● is annoying
Poorly written code...
● on a huge number of clients... ● can cause serious infrastructure pain
Write good code and stage your releases
● Work with the service owners ● Stage rollouts, allow soak time ● Have a rollback plan for clients and test it ● Have DoS limits for services, test them
![Page 5: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013](https://reader033.fdocuments.us/reader033/viewer/2022060122/55951e551a28ab0c4b8b45f2/html5/thumbnails/5.jpg)
Load balancing is fun
Do you have enough capacity? • How many backends do you need? • What happens if half of your backends lose power? • What about when half are already out for repairs?
How do you send clients to the right cluster?
• Client configuration • DNS round-robin (simple global load balancing) • DNS views (give best answer for client IP) • Anycast (portable IP, routed to "nearest" cluster) • Consider: DNS views plus Anycast
![Page 6: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013](https://reader033.fdocuments.us/reader033/viewer/2022060122/55951e551a28ab0c4b8b45f2/html5/thumbnails/6.jpg)
But global outages are not
Monitor everything ● Health check failures bring down your service ● ...by design
Test everything
● You should expect (and test) data center outages ● A global outage can ruin your day ● Cascading failures are unpleasant
Learn from outages
● Write postmortems ● Focus on the facts! ● What went wrong and what can be better? ● A postmortem is not about blame
![Page 7: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013](https://reader033.fdocuments.us/reader033/viewer/2022060122/55951e551a28ab0c4b8b45f2/html5/thumbnails/7.jpg)
Thundering herds are not
For Puppet • "Lots" of Mac desktops and laptops • "Lots" of Ubuntu desktops, laptops and servers • "Some" others
What if they all want to do a puppet run?
• What about every hour? • What about every five minutes?
Randomize your cron jobs! (and test it) How can you shed load on the server?
![Page 8: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013](https://reader033.fdocuments.us/reader033/viewer/2022060122/55951e551a28ab0c4b8b45f2/html5/thumbnails/8.jpg)
Anycast is fun
Anycast is "coarse-grain" load balancing • Routes traffic to the “nearest”, “serving” cluster
Networks break
• Physical issues • Routing issues • Configuration issues • Load balancer bugs
Anycast monitoring is hard
![Page 9: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013](https://reader033.fdocuments.us/reader033/viewer/2022060122/55951e551a28ab0c4b8b45f2/html5/thumbnails/9.jpg)
Anycast directed to one site is not fun
![Page 10: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013](https://reader033.fdocuments.us/reader033/viewer/2022060122/55951e551a28ab0c4b8b45f2/html5/thumbnails/10.jpg)
Anycast directed to one site is not fun All clients could be sent to the same cluster
• Be ready for that • Can a single cluster handle worldwide traffic? • What do you do if it can't?
Have a mitigation strategy to shed load
● Include load calculations early in health checks ● Consider DNS views to redirect some traffic ● Drop traffic if you have to
![Page 11: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013](https://reader033.fdocuments.us/reader033/viewer/2022060122/55951e551a28ab0c4b8b45f2/html5/thumbnails/11.jpg)
Diversity is good...for people
Be ruthless against platform diversity If you can’t automate it, don’t do it
● “Could we bring up another 50 today, please?” ● “That backend was just a little different and...oops”
Anycast helps you be consistent
● Traffic could go anywhere Every OS upgrade is a time to refactor and clean