The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load...

31
The SRE I aspire to be Yaniv Aknin // @aknin #SREcon Dublin 2019

Transcript of The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load...

Page 1: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to beYaniv Aknin // @aknin

#SREcon Dublin 2019

Page 2: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

Who is this guy● Google SRE since 2013

Most recently GCP's Quantitative Reliability Lead

● Jack of all tradesEqual parts SRE, dev, and /pro(duct|ject) manager/

● Opinions my ownBut I owe a lot here to others

Page 3: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

● Google SRE since 2013Most recently GCP's Quantitative Reliability Lead

● Jack of all tradesEqual parts SRE, dev, and /pro(duct|ject) manager/

● Opinions my ownBut I owe a lot here to others

Who is this guy

NB: what does "SRE" really mean?

*

*

Page 4: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

Wikipedia says Engineering is "using scientific principles to design and build $THINGS"

https://en.wikipedia.org/wiki/Engineering

Page 5: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

Wikipedia says Engineering is "using scientific principles to design and build $THINGS"

Imagine THINGS="Reliability"... how do we apply science to that?

https://en.wikipedia.org/wiki/Engineering

Page 6: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

William Thomson (Lord Kelvin)President of the Royal SocietyLecture on "Electrical Units of Measurement"Published in "Popular Lectures", Vol. 1, 1883(abridged to fit slide)

When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, your knowledge is of a meagre and unsatisfactory kind.

“ ”

Page 7: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

Measurably optimise reliability vs cost

Page 8: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

On ops, user harm, and tradeoffs

Ops

ReliabilityYour product is here.

Page 9: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

On ops, user harm, and tradeoffs

Ops

ReliabilityYour product is here.

Page 10: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

On ops, user harm, and tradeoffs

Ops

ReliabilityYour product is here.

Page 11: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

On ops, user harm, and tradeoffs

Ops

ReliabilityYour product is here.

Engineering

Page 12: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

Engineering reliability

HDDA single HDD has an annualized failure rate (AFR) of ~1.5%

How can we build a more reliable logical disk?

Page 13: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

Engineering reliability

HDDA single HDD has an annualized failure rate (AFR) of ~1.5%

How can we build a more reliable logical disk?

HDDHDD

RAID-1 (mirror) should theoretically offer 1.5%2 (~0.02%) AFRAssuming immediate disk replacement+replication after failure, completely independent disk failures, and no RAID related bugs. None of which is even remotely true, of course.

Page 14: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

The (modest) reliability engineer toolbox

2

Redundant resourceTrade cost

1

1

Retry transient failuresTrade latency

1

Degraded resultsTrade quality

1

2

1

1

1

Page 15: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

X

The (modest) reliability engineer toolbox

2

Redundant resourceTrade cost

1

1

Retry transient failuresTrade latency

1

Degraded resultsTrade quality

1

2

1

1

1

Any of theseAdds complexity

Page 16: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

Partitioning Sidecar Fail Static Self-healing

A1 A2 A3 A A

A A AA

A'✔

Compound/advanced reliability patterns

121 3 1 1 1

t1 t3 t2 t1 t2 t3

1 2 3 1'1

AA A

1

Waterfall Jitter Breaker Infra as Code

Page 17: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

Innovation(engineering, proactive, change)

Reliability(support, reactive, preserve)

Page 18: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

Reliability

Innovation

(support, reactive, preserve)

(engineering, proactive, change)

?

Page 19: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

Reliability

Innovation

(engineering, proactive, change)

(engineering, proactive, change)

The Error Budget

Page 20: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

MTBF/MTTR

Challenge: fungible definition of "failure"

"9s" (e.g. "99.95% uptime")

Challenge: aggregating individual eventsinto business credible 9s

99.99%

MTBF

MTTR 99.9%

Page 21: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

You need "better quality" 9s!

"Whatever I happened to ship"

"I spent time making my metrics hit certain thresholds"

"Whatever I happenedto measure"

"I spent time ensuring 9s correlate with customer pain"

Page 22: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

Happy Customers

"Whatever I happened to ship"

"I spent time making my metrics hit certain thresholds"

"Whatever I happenedto measure"

"I spent time ensuring 9s correlate with customer pain"

WastedEffort

Unknown Problem

Known Problem

First move right, then move up

Page 23: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

Measurably optimise reliability vs cost

Ops

ReliabilityYour product is here.

Engineering

Revenue

Cost

Page 24: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

● Scope

● Difficulty

● Cost++

● Misconceptions

Why is this hard?

Page 25: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

● Scope

● Difficulty

● Cost++

● Misconceptions

Why is this hard? And why is it good?● Leverage

● Cost--

● Precision

Page 26: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

SRE team: a recipe

FundamentalMonitoring

AlertingCapacity planningCI/CD & RolloutsLoad Balancing

Page 27: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

SRE team: a recipe

FundamentalMonitoring

AlertingCapacity planningCI/CD & RolloutsLoad Balancing

AdvancedSystem Architecture

Distributed AlgorithmsNetworking

Operating Systems

Page 28: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

SRE team: a recipe

FundamentalMonitoring

AlertingCapacity planningCI/CD & RolloutsLoad Balancing

AdvancedSystem Architecture

Distributed AlgorithmsNetworking

Operating Systems

PioneeringProduct Management

Data ScienceBusiness Acumen

UX Research

Page 29: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

The SRE I aspire to be● Have a measurement of reliability

● The measurement is tied to project priorities

● Your ops work is tied to the measurement

Page 30: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

The SRE I aspire to be● Have a measurement of reliability

● The measurement is tied to project priorities

● Your ops work is tied to the measurement

Please remember this is my aspiration... tell me yours?

*

*

Page 31: The SRE I aspire to be - USENIX...Revenue Cost. The SRE I aspire to be // @aknin Scope ... Load Balancing Advanced System Architecture Distributed Algorithms Networking Operating Systems

The SRE I aspire to be // @aknin

Thank you!

Art credits"Lord Kelvin", Messrs. Dickinson, London, goo.gl/RHF61Z, [cropped]

"Complex looking chart", MIT SERG, http://strategic.mit.edu, [recoloured]"Gears" and "Robot", Google AutoDraw, CC4, [recoloured and adapted]

Yin Yang, https://en.wikipedia.org/wiki/File:Yin_yang.svg [recoloured]

Yaniv Aknin // @aknin