and a Chance to Make Things Better Using Documentation ...

38
Daniel “Spoons” Spoonhower, CTO and Co-founder Building Service Ownership Using Documentation, Telemetry, and a Chance to Make Things Better

Transcript of and a Chance to Make Things Better Using Documentation ...

Page 1: and a Chance to Make Things Better Using Documentation ...

Daniel “Spoons” Spoonhower, CTO and Co-founder

Building Service Ownership Using Documentation, Telemetry, and a Chance to Make Things Better

Page 2: and a Chance to Make Things Better Using Documentation ...

Spoons (aka Daniel Spoonhower)CTO and Co-founder, Lightstep

2

@save_spoons

[email protected]

Who am I?

Page 3: and a Chance to Make Things Better Using Documentation ...

Some Background

3

Page 4: and a Chance to Make Things Better Using Documentation ...

4

SRE Dev

SRE Dev

SRE Dev

SRE Dev

SREDev

DevDev Dev

Dev

Page 5: and a Chance to Make Things Better Using Documentation ...

5

CC BYSA Kharnagy

Page 6: and a Chance to Make Things Better Using Documentation ...

6

Dev

DevDev Dev

DevSRE

How do we create a map between our software architecture and our organization?

Page 7: and a Chance to Make Things Better Using Documentation ...

Service ownership, defined

Dev teams are responsible for delivery of software and service

Includes activities such as:- Writing code 😁- Fixing bugs 😞- Incident response- Cost management- Capacity planning- …

7

Increased independence ==> Higher developer velocity

Autonomous decisions ==> Better outcomes

Page 8: and a Chance to Make Things Better Using Documentation ...

Obstacles to successful service ownership

You can’t have independence without clear responsibilities and goals.

You can’t scale autonomy without consistent ways of measuring and reporting on progress and outcomes.

You will meet (strong) resistance to changes ownership unless you also give teams the agency to change things.

8

Page 9: and a Chance to Make Things Better Using Documentation ...

9

Ownership means Accountability and Agency

Effective ownership requires Distributed Tracing

Importance of Documentation, Oncall, and SLOs

Page 10: and a Chance to Make Things Better Using Documentation ...

Distributed Tracing

10

Page 11: and a Chance to Make Things Better Using Documentation ...
Page 12: and a Chance to Make Things Better Using Documentation ...

Traces are a form of telemetry based on spans with structure- Span = timed event describing work done by a single service

Tracing is a diagnostic tool that reveals…

… how a set of services coordinate to handle individual user requests … from mobile or browser to backends to databases (end-to-end) … including metadata like events (logs) and annotations (tags)

Provides a request-centric view of application performance

Distributed tracing, defined

12

Page 13: and a Chance to Make Things Better Using Documentation ...

Relationships matter

13

Traces encode causal relationships between callers and callees

callsreturns

Page 14: and a Chance to Make Things Better Using Documentation ...

Traces = raw material, not finished product

Distributed traces – basically just structs

Distributed tracing – the art and science of deriving value from traces

14

Page 15: and a Chance to Make Things Better Using Documentation ...

Building Service Ownership

15

Page 16: and a Chance to Make Things Better Using Documentation ...

16

DocsOncall

SLOs

Page 17: and a Chance to Make Things Better Using Documentation ...

Start with expertise... then ownership

Make it easy to find related:

- Telemetry and dashboards- Alert definitions- Playbooks

Use a template!

Track last-modified dates

- Require periodic audits & updates

Centralized documentation

17

Page 18: and a Chance to Make Things Better Using Documentation ...

Make documentation machine-readable

Use it to generate:

- Dashboard config- Escalation policy config- Deployment pipeline config- …

Make updating documentation necessary for day-to-day work

Centralized documentation

18

& Machine-readable

Page 19: and a Chance to Make Things Better Using Documentation ...

It’s hard to keep service dependencies up to date manually…

So don’t!

Use telemetry from the application

- Traces, in aggregate, reveal service dependencies

- Service levels show current reliability

Centralized documentation

19

& Machine-readable

& Dynamic!

Page 20: and a Chance to Make Things Better Using Documentation ...

Record who is accountable

Automate many mundane tasks

Train new team members

Build confidence

Why is documentation important?

20

Page 21: and a Chance to Make Things Better Using Documentation ...

Oncall is (often) responsible for…

- Incident response- Communicating status internally &

externally- Production change management

- Deploying new code - Pushing infrastructure changes

- Monitoring dashboards- Low-urgency alert triage- Customer requests

- And other interrupt driven work- Shift handoffs- Writing postmortems

Oncall

21

Photo by VT98Fan and Starwhooper

Page 22: and a Chance to Make Things Better Using Documentation ...

Iterating toward ownership

Establish a need to split- Survey expertise & happiness- Look at response time, number

of people per incident

Some shock absorbers:- Experts on the rotation- Good documentation - Balance between rotation size

and number of services

22

Page 23: and a Chance to Make Things Better Using Documentation ...

How to improve incident response:

- Reduce response times- Deliver alerts to the right teams

Handling alerts

23

Page 24: and a Chance to Make Things Better Using Documentation ...

More context ⟶ mitigating faster

24

Page 25: and a Chance to Make Things Better Using Documentation ...

Send alerts directly to the teams that are responsible for taking action!

“Are We All on the Same Page?”Luis Mineiro @SREcon19 EMEA

Dynamic alert delivery

25

Page 26: and a Chance to Make Things Better Using Documentation ...

How to improve incident response:

- Reduce response times- Deliver alerts to the right teams

Handling alerts

26

- Delete unnecessary alerts- Adjust rotation schedule to better fit

team and sprint structure

Photo by Steve Johnson on Unsplash

Page 27: and a Chance to Make Things Better Using Documentation ...

“How will we do better next time?”

- Ensure underlying issues are fixed- Improve responses for novel issues

Improving postmortems

27

Photo by Tamara Gak on Unsplash

Page 28: and a Chance to Make Things Better Using Documentation ...

Postmortems are documentation

Make it easy!Make sure they are centralizedMake sure action items and other info are captured in a structured wayAnd of course) leverage telemetry

28

Page 29: and a Chance to Make Things Better Using Documentation ...

Direct impact on customer experience (revenue, reputation, etc.)

Time spent handing pages, writing postmortems, handling interrupts is... time not spent building new features, proactive optimization

Stress of oncall has major impact on job satisfaction

Why is improving oncall important?

29

Page 30: and a Chance to Make Things Better Using Documentation ...

Service Level Objectives

- Customer expectations for the service your provide

- Both external and internal customers- With a threshold that lets you report

success/failure over time (or groups of events)

- As a means of communicating how reliable the service is

SLOs

30

p99 latency < 5s over the last 5 minutes

threshold

service level indicator (SLI measurement window

Aim for less-than-perfect

- three nines, 99.5%, 98.2%, ...

Common indicators

- Latency (p50, p99, etc.)- Error rate

But specific to the endpoints, operations, and flows of your application

Page 31: and a Chance to Make Things Better Using Documentation ...

Ask:

- What do your customers expect?- What can you provide today?- How do you expect that to change?

Determining SLOs

31

Page 32: and a Chance to Make Things Better Using Documentation ...

Derive internal SLOs using tracing

32

A

B

C

A p99 latency < 5s

B p99 latency < 5s

C p99 latency < 2.5s

D

~p99.5 latency < 2.5s

Page 33: and a Chance to Make Things Better Using Documentation ...

They measure success in delivering service

Teams use them as a guide to prioritize work

Consistency and transparency across your org- Hold teams accountable in a uniform way

Why are SLOs important?

33

Page 34: and a Chance to Make Things Better Using Documentation ...

3-piece puzzle review

Documentation- Establishes ownership: who you will hold accountable- Don’t try to manually manage dependency lists, etc.

Oncall- More than just incident response- Use docs and telemetry as part of investigation, automation,

communicationSLOs- How you communicate and measure success!- Define objectives for internal services using tracing

34

DocsOncall

SLOs

Page 35: and a Chance to Make Things Better Using Documentation ...

Next steps…

35

Page 36: and a Chance to Make Things Better Using Documentation ...

Making changes

Rolling out new processes/tools with many teams is hard1. Process/tools must provide value to dev teams2. Ideally, they are necessary parts of their day-to-day work

To establish and maintain service ownership- Use a combination of docs, oncall process, and SLOs- Manufacture a need for those process/tools where necessary- Give teams a budget for improving docs, alerts, and reliability

36

Page 37: and a Chance to Make Things Better Using Documentation ...

Ownership = Accountability + Agency

Accountability- Set deliverables and goals for service owners- Judge their performance based on those deliverables and goals

Agency- Offer the information, confidence, and budget to improve

37

Telemetry provides key information to drive both!

Page 38: and a Chance to Make Things Better Using Documentation ...

Daniel “Spoons” Spoonhower, CTO and Co-founder

Thank you

@save_spoons