SRE - drupal day aveiro 2016

43
1 [email protected] Ricardo Amaro September 2016 Implementing Site Reliability Engineering

Transcript of SRE - drupal day aveiro 2016

Page 1: SRE - drupal day aveiro 2016

1 [email protected]

Ricardo AmaroSeptember 2016

Implementing Site Reliability Engineering

Page 2: SRE - drupal day aveiro 2016

2 [email protected]

Who am I?@Drupal

@ricardoamaro

Portugal

Lisbon

Drupal Community

Family

+8 years Drupal

90’s Linux Adopter

5 years at Acquia

Site Reliability Engineer -Senior Tier2 Ops

https://drup

al.org/user/66

6176

Page 3: SRE - drupal day aveiro 2016

4 [email protected]

About Acquia Metrics

○ Acquia Cloud:○ # of Instances (17,200+)○ # of Production Sites (54,000+) (incl. some biggest sites in the world)

○ # API Calls (3,000 + per sec)○ # of Availability Zones (20+)○ # of World Regions (8)

Page 4: SRE - drupal day aveiro 2016

5 [email protected]

We will talk aboutA brief summary inspired on Google’s S.R.E. book

○ What is S.R.E?○ Tenets of S.R.E.○ Reliability & Toil○ Error budget - keeping the Service Level Objective (SLO)○ Development & Operations○ Monitoring and Being On-Call○ Postmortem culture - Learning from failure

Page 5: SRE - drupal day aveiro 2016

6 [email protected]

What is S.R.E.?

Page 6: SRE - drupal day aveiro 2016

7 [email protected]

➔ Term crafted by Google in 2003.

➔ When Ben Treynor was hired to run

“production” and ended up “applying

software engineering to an operations

function”

Site Reliability Engineering

Page 7: SRE - drupal day aveiro 2016

8 [email protected]

➔ SRE is taken seriously by major companies

Site Reliability Engineering

Microsoft

Apple

Amazon

Page 8: SRE - drupal day aveiro 2016

9 [email protected]

SREs are engineers that...

➔ Apply the principles of computer science and engineering to

design and develop large, distributed computing systems.

➔ Write software for those systems alongside product developers.

➔ Build all additional pieces those systems need, like backups and

load balancing.

➔ Reuse old solutions for new problems.

Site Reliability Engineering

Page 9: SRE - drupal day aveiro 2016

10 [email protected]

“Reliability is the most fundamental feature of any product.”Ben Treynor, Google’s VP for 24/7 Operations

Site Reliability Engineering

Page 10: SRE - drupal day aveiro 2016

11 [email protected]

DevOps & S.R.E.

DevOps is a practice, which was coined around 2008, that encompasses automation of manual tasks, continuous integration and continuous delivery. It applies to a wide audience of companies whereas SRE might be considered a subset of DevOps that possesses additional skill sets.

Source: https://en.wikipedia.org/wiki/Site_reliability_engineering

Page 11: SRE - drupal day aveiro 2016

12 [email protected]

Tenets of S.R.E.

Page 12: SRE - drupal day aveiro 2016

13 [email protected]

1. Ensuring a Durable Focus on Engineering2. Pursuing Maximum Change Velocity Without Violating SLOs3. Monitoring4. Emergency Response5. Change Management6. Demand Forecasting and Capacity Planning7. Provisioning8. Efficiency and Performance

Tenets of SRE

Page 13: SRE - drupal day aveiro 2016

14 [email protected]

➔ Hire only coders➔ Have a Service Level Objective (SLO) for your service➔ Measure and report performance against SLOs➔ Use Error Budgets and gate launches on them➔ Have a Common staffing pool for SRE and DEV➔ Excess Ops work overflows to DEV team➔ Cap SRE operational load at 50% and share 5% with the DEV team➔ Oncall teams at least 8 people at one location or 6 on each, each product➔ Maximum of 2 events per oncall shift➔ Post mortem for every event➔ Post mortems are BLAMELESS and focus on process and technology, not people

How to achieve S.R.E.Treynor’s Action items

Page 14: SRE - drupal day aveiro 2016

15 [email protected]

Reliability & Toil

Page 15: SRE - drupal day aveiro 2016

16 [email protected]

The latest feature or

That the product works?

What is the most important Feature of a product?

Page 16: SRE - drupal day aveiro 2016

17 [email protected]

How about the “503” feature ?

The most important thing is that the product works!

Page 17: SRE - drupal day aveiro 2016

18 [email protected]

The 80’s Waterfall software delivery model

Operations @customer ➔ *Provisioning➔ *Installing➔ *Upgrading➔ *Maintaining➔ *Backups/Restore➔ *Scaling

Source: wikipedia

Page 18: SRE - drupal day aveiro 2016

19 [email protected]

Then came the web...

● Software as a Service● Platform as a Service● Cloud computing ● ...

➔ Operations overhead not on the customer side➔ Features could now be delivered faster➔ Customer feedback important for product improvements

Product

DevelopmentShip Features

OperationsUsers

Page 19: SRE - drupal day aveiro 2016

20 [email protected]

Opposite rewarding conflicts

Objectives:➔ Ship new features➔ Launch new products

Objectives:➔ Reliability & Availability➔ Customer success

Dev Ops

Page 20: SRE - drupal day aveiro 2016

21 [email protected]

The problem: Toil"exhausting labour"

➔ Manual➔ Repetitive➔ Automatable➔ Tactical (Unplanned work)

➔ No enduring value➔ Scales linearly with service growth

(not just “work I don’t like to do.”)

Page 21: SRE - drupal day aveiro 2016

22 [email protected]

An Old Solution to Toil

Caption goes here

● Scale with bodiesIn the old operations model, you throw people at a reliability problem and keep pushing (sometimes for a year or more) until the problem either goes away or blows up in your face.

Page 22: SRE - drupal day aveiro 2016

23 [email protected]

As your business grows, workload trends to infinity

(x) time

● Cap Ops WorkloadAs your business grows, you need to reduce manual labor in order to continue delivering features. Put a 50% cap on Ops work and leave most of the SRE team time for writing code and reducing Toil.

(y) c

usto

mer

s/tr

affic

Workload/Toil over time

Page 23: SRE - drupal day aveiro 2016

24 [email protected]

Google’s example➔ Keep operational work (i.e., toil) below 50% of each SREs time➔ More than 50% of each SREs time is spent on:

◆ engineering project work to reduce toil ◆ add service features - improving reliability, performance, utilization

➔ Improves career planning for the SRE➔ Improves morale on the organization

➔ An SRE team can easily devolve into an Ops team if the 50% target is broken.

Why less Toil is BetterS.R.E. - A modern solution

Page 24: SRE - drupal day aveiro 2016

25 [email protected]

S.R.E. - A modern solutionDEV + OPS

➔ This conflict is not inevitable➔ The solution is: Error Budgets!➔ Everyone agrees on an Error Budget (has we will explain next)➔ SRE only prevents releases or Launches if the Error Budget is exceeded.

Dev Ops

Page 25: SRE - drupal day aveiro 2016

26 [email protected]

Error budgets:keeping the SLOs

Page 26: SRE - drupal day aveiro 2016

27 [email protected]

Example: A 99.9% availability SLO means that the service can be 0.1% unavailable, which is the error budget.

What is an Error Budget?

The business or the product establishes Service Level Objectives (SLOs) for the system, based on Service Level indicators such as error rate, availability or latency...

Error Budget

Page 27: SRE - drupal day aveiro 2016

28 [email protected]

➔ 100% is the wrong reliability target for basically everything.➔ Set an SLO that acknowledges the trade-off and leaves an error budget➔ Error budget can be spent on anything: launching features, etc.➔ Error budget allows for discussion about how phased rollouts and 1%

experiments can maintain tolerable levels of errors.➔ Goal of SRE team isn’t “zero outages” – SRE and product devs are

incentive aligned to spend the error budget to get maximum feature

velocity.

➔ Out of Budget? No problems. Do more testing between releases.

How to obtain and use the Error Budget?

Page 28: SRE - drupal day aveiro 2016

29 [email protected]

➔ This puts an incentive to developers that drives them to value stability (not just change).

➔ And gives control that drives SREs to permit change (not just stability).

➔ It forces decisions based on metrics, not politics- nor feelings, just data.

Error Budget A Self-regulating mechanism

Page 29: SRE - drupal day aveiro 2016

30 [email protected]

Development & Operations

Page 30: SRE - drupal day aveiro 2016

31 [email protected]

➔ Development and SRE teams share a

single staffing pool◆ If all is Reliable Devs are

rewarded with teammates

◆ If Ops is overloaded, SREs are

contracted to support code

How are Development & Operations teams organized?

Now tell me… Why should I hire you?

Page 31: SRE - drupal day aveiro 2016

32 [email protected]

➔ SREs are developer/sys-admin hybrids

◆ They perform more Dev work as

things become stable

Development & Operations

Systems, code… Are you able to cook also?

Page 32: SRE - drupal day aveiro 2016

33 [email protected]

➔ SRE can only spend up to 50% of their time on ops work

➔ If operational load exceeds 50%, the ops work overflows to Dev

➔ Allow them to move to other projects

Development & Operations

Page 33: SRE - drupal day aveiro 2016

34 [email protected]

Monitoring and Being On-Call

Page 34: SRE - drupal day aveiro 2016

35 [email protected]

➔ An engineer can only react with urgency a few

times a day before they get fatigued.

➔ Every page should be actionable.

➔ Every page response should require intelligence.

➔ Pages should be about a new problem or an

event that hasn’t been seen before.

Pager fatigueA serious a problem to be addressed

Page 36: SRE - drupal day aveiro 2016

37 [email protected]

A healthy monitoring and alerting pipeline is simple and easy to reason about

Monitoring Conclusion

Page 37: SRE - drupal day aveiro 2016

38 [email protected]

Postmortem cultureLearning from failure

Page 38: SRE - drupal day aveiro 2016

39 [email protected]

➔ Document written for ALL significant incidents ➔ Non-paged incidents are even more valuable - monitoring gaps➔ Explain what happened in detail ➔ Find all root causes of the event➔ Assign actions to correct the problem or improve how it is addressed next time

What are Postmortems?

Page 39: SRE - drupal day aveiro 2016

40 [email protected]

➔ Use a blame free postmortem culture, with the goal of exposing faults◆ apply engineering to fix these faults, ◆ Try not just avoid or minimize them.

Postmortems Are Blameless!

Page 40: SRE - drupal day aveiro 2016

41 [email protected]

Learn and teach with postmortems

Source: http://www.xkcd.com/1495/

Page 42: SRE - drupal day aveiro 2016

43 [email protected]

Questions?

[email protected] 2016