The Service Score Gamifying Operational …...Gamifying Operational Excellence Basically, if we can...

Post on 06-Jul-2020

78 views 0 download

Transcript of The Service Score Gamifying Operational …...Gamifying Operational Excellence Basically, if we can...

Gamifying Operational Excellence

TheService Score Card

1 The Problem

3 A Solution tour

4 The results

5 Take aways & lessons Learnt & Questions

2 A Solution idea

Agenda

“If it's not broken, I’ll fix it.”

From Australia, on loan as

Staff SRE @ linkedIn

jobs, companies, recruiter

& Finder of encoding bugs

about meDanny ☃ Lawrence

“If it's not broken, I’ll fix it.”

From Australia, on loan as

Staff SRE @ linkedIn

jobs, companies, recruiter

& Finder of encoding bugs

about meDanny ☃ Lawrence

“If it's not broken, I’ll fix it.”

From Australia, on loan as

Staff SRE @ linkedIn

jobs, companies, recruiter

& Finder of encoding bugs

about meDanny ☃ Lawrence

“If it's not broken, I’ll fix it.”

From Australia, on loan as

Staff SRE @ linkedIn

jobs, companies, recruiter

& Finder of encoding bugs

about meDanny ☃ Lawrence

“If it's not broken, I’ll fix it.”

From Australia, on loan as

Staff SRE @ linkedIn

jobs, companies, recruiter

& Finder of encoding bugs

about meDanny ☃ Lawrence

Good news SRECON.

You passed the ☃ test.

about meDanny ☃ Lawrence

Some terms(before we really get started)

Operational Excellenceeffective and efficient delivery of information, technology, and services required by end users

that add measurable value.

10

Gamifying Operational Excellence

Operational ExcellenceDoing everything required to make sure

all of your services are as fast and as reliable as possible.

11

Gamifying Operational Excellence

Gamificationapplication of game-design elements and game

principles in non-game contexts.

12

Gamifying Operational Excellence

Some background(LinkedIn SRE crash course)

Mostly JavaMultitudes of services

Doing lots of thingsService-oriented architectureEverything talks to everything

My direct team looks after 80+ servicesWe have 200+ SREs

14

LinkedIn SRE Crash Course

The Problem(What started this whole thing)

Problem 1:The GOOD

& The BAD

16

Gamifying Operational Excellence

BAD serviceswake me up

17

Gamifying Operational Excellence

GOOD serviceslet me sleep

18

Gamifying Operational Excellence

What makes a GOOD service at LinkedIn is a moving target.

19

Gamifying Operational Excellence

Technologies and dependencies change

over time.

20

Gamifying Operational Excellence

Upgrading dependencies & libraries Java / Jetty / Play / Tomcat

Correct usage of TLSSwitching databases / caches

Migrate from SVN to GITReduce application startup time

Setup error budgetingTrue up the number of metrics

21

Some examples

A GOOD service can turn into a BAD service.

If you are not checking it

22

Gamifying Operational Excellence

UnfortunatelyBAD services

do not magically turn into

GOOD services23

Gamifying Operational Excellence

Problem 2:Knowing what is BAD

24

Gamifying Operational Excellence

Problem 3:Knowing why it’s BAD

25

Gamifying Operational Excellence

Problem 4:Tribal knowledge

about how to get to GOOD

26

Gamifying Operational Excellence

The only thing we appear to hate more than not having documentation,

...Is writing documentation.

27

Gamifying Operational Excellence

The Problemsummary

BAD services wake me upTime will cause GOOD to turn BAD

Hard to know what is BADHard to know why is BAD

Not sure how to fix the BAD

29

Gamifying Operational Excellence

The Service ScoreCard(A solution)

In order determine the healthof the services we support,

we define a list of production requirements.

31

Gamifying Operational Excellence

Apply a weight to each requirement

32

Gamifying Operational Excellence

Codify each requirement into a check.

33

Gamifying Operational Excellence

Execute these checksfor each service

34

Service Scorecard

Tally up the results for service.

35

Gamifying Operational Excellence

Grade the service from“F” to “A+”

36

Gamifying Operational Excellence

Add all the services into a highscore system

37

Gamifying Operational Excellence

Then

38

Gamifying Operational Excellence

Publish those scores to the company

39

Gamifying Operational Excellence

This is great, but how do I improve the score?

How can I add X check into the system.

40

Gamifying Operational Excellence

What makes a check?

checks are one type of plugin.

fetch plugins gather datacheck plugins check the data.

42

Gamifying Operational Excellence

We use the fetch plugin to gather remote data from:

SVN, GIT, Configuration DBs,host databases, monitoring systems, build systems, deployment systems.

43

Gamifying Operational Excellence

Basically,if we can fetch it,

then we do so.

44

Gamifying Operational Excellence

We build a giant context object.

45

Gamifying Operational Excellence

The check plugin will look at our context object.

46

Gamifying Operational Excellence

All plugins are small python scripts,where small is 10~30 LOC

47

Gamifying Operational Excellence

Simply return 2 or 3 things.

state*: True, False, None or 0.0 - 1.0message*: short stringdata: python dict of interesting things.

48

Gamifying Operational Excellence

Example fetch plugin

@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”

o = r.get(“http://owners/” + service_name)

return True, “gathered data”, o.json()

50

@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”

o = r.get(“http://owners/” + service_name)

return True, “gathered data”, o.json()

51

@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”

o = r.get(“http://owners/” + service_name)

return True, “gathered data”, o.json()

52

@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”

o = r.get(“http://owners/” + service_name)

return True, “gathered data”, o.json()

53

@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”

o = r.get(“http://owners/” + service_name)

return True, “gathered owner data”, o.json()

54

Example check plugin

@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”

if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”

56

@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”

if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”

57

@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”

if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”

58

@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”

if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”

59

@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”

if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”

60

@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”

if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”

61

@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”

if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”

62

Putting it all together

Problems

Understanding what is BADKnowing why it is BAD

Not sure how to fix the BAD

64

Gamifying Operational Excellence

Problems

Understanding what is BAD

65

Gamifying Operational Excellence

66

Service Scorecard

67

Service Scorecard

68

Service Scorecard

69

Service Scorecard

70

Service Scorecard

71

Service Scorecard

72

Service Scorecard

73

Service Scorecard

74

Service Scorecard

75

Service Scorecard

76

Service Scorecard

77

Service Scorecard

78

Service Scorecard

79

Service Scorecard

Problems

Understanding what is BADKnowing why it is BAD

80

Gamifying Operational Excellence

81

Service Scorecard

82

Service Scorecard

83

84

85

86

87

88

89

90

91

92

93

Problems

Understanding what is BADKnowing why it is BAD

Not sure how to fix the BAD

94

Gamifying Operational Excellence

95

96

97

98

99

What is the check?Why is it important?

How long it will take to fix?How will it be fixed?

100

Gamifying Operational Excellence

101

102

AngularJSimage: CC BY 4.0 https://angular.io/presskit.html (2017)

103

{{service_name}}becomes

jobs-server

104

105

{{context.ownership.eng_owner}}becomesjobs-team

Using our fetched data in the wiki

107

{{service_name}}

108

{html}

<script src=”https://cdn/angularjs.js”/ >

{html}

109

var query = $location.search();

var service_name = query[‘service_name’];

var url = ‘http://ssc/api/’ + service_name;

$http.get().success(

function(ctx) {

$scope.ctx = ctx;

}

);

110

var query = $location.search();

var service_name = query[‘service_name’];

var url = ‘http://ssc/api/’ + service_name;

$http.get().success(

function(ctx) {

$scope.ctx = ctx;

}

);

111

var query = $location.search();

var service_name = query[‘service_name’];

var url = ‘http://ssc/api/’ + service_name;

$http.get().success(

function(ctx) {

$scope.ctx = ctx;

}

);

112

var query = $location.search();

var service_name = query[‘service_name’];

var url = ‘http://ssc/api/’ + service_name;

$http.get().success(

function(ctx) {

$scope.ctx = ctx;

}

);

113

var query = $location.search();

var service_name = query[‘service_name’];

var url = ‘http://ssc/api/’ + service_name;

$http.get().success(

function(ctx) {

$scope.ctx = data;

}

);

114

var query = $location.search();

var service_name = query[‘service_name’];

var url = ‘http://ssc/api/’ + service_name;

$http.get().success(

function(ctx) {

$scope.ctx = ctx;

}

);

115

{{ctx.ownership.owner_eng}}

116

{{ctx.ownership.owner_eng}}

{{ctx.number_of_hosts}}

{{ctx.product.lib.jetty.version}}

{{ctx.hosts.hostnames}}

{{ctx.is_deployed_in_prod}}

{{ctx.commits.last_commit}}

Problems

Understanding what is BADKnowing why it is BAD

Not sure how to fix the BAD

117

Gamifying Operational Excellence

Now

Reports show what is BADChecks validate why it is BADWiki shows how to fix the BAD

118

Gamifying Operational Excellence

No more of these emails

“If you use a lib-core, then upgrade it, we found a bug”

119

Gamifying Operational Excellence

How many of my 80 services use this lib?How do I check?

How do I upgrade?

120

Gamifying Operational Excellence

121

122

123

Where does this tool fit?

125

Gamifying Operational Excellence

pre-commit Build Deployment Monitoring

126

Gamifying Operational Excellence

pre-commit Build Deployment Monitoring

Service Scorecard

127

Gamifying Operational Excellence

pre-commit Build Deployment Monitoring

Service Scorecard

API

128

Gamifying Operational Excellence

Service Scorecard

API

hack-days Reporting Deployment Monitoring

Results &

Outcomes

What we do with the scores?

130

Gamifying Operational Excellence

Priority #1:Getting the grades better

131

Gamifying Operational Excellence

132

When we started Now

Average grade for my team 40% 80%

Average score across SRE 35% 60%

Checks in 24 hours 15,560 89,859

Number of checks per service 15 31

Center the source to page, and align to bottom of page number. Do not increase in size, and keep on one line.

Gamifying Operational Excellence

We can now explore news ways to use the scores

133

Gamifying Operational Excellence

Carrot&

Stick

134

Gamifying Operational Excellence

Carrot / GOOD

Stick / BAD

135

Gamifying Operational Excellence

No SRE supportfor

F Gradeservices.

136

Gamifying Operational Excellence

F Grade services generally cause the

most problems.

137

Gamifying Operational Excellence

No deploy moratorium for

A+ services

138

Gamifying Operational Excellence

A+ services generally cause the

least problems.

139

Gamifying Operational Excellence

A servicesare allowed to deploy 24/7

140

Gamifying Operational Excellence

Premium SRE support for A+ services

141

Gamifying Operational Excellence

Priority build queuesfor

GOODServices.

142

Gamifying Operational Excellence

Tiger teams to raise the scores on

F Grade services

143

Gamifying Operational Excellence

Hack Days

144

Gamifying Operational Excellence

FREE BEER

145

Gamifying Operational Excellence

Basically any problem can be solve with

FREE BEER

146

Gamifying Operational Excellence

OR T-Shirts

147

Gamifying Operational Excellence

/

148

Influence where we allocate open headcount

149

Gamifying Operational Excellence

Simple way to get things done

150

Gamifying Operational Excellence

Take aways&

Lessons Learnt

Everyone cares about Reliability.

152

Gamifying Operational Excellence

Everyone cares about Reliability,

Everyone is a Site Reliability Engineer.

153

Gamifying Operational Excellence

Everyone cares about Reliability,

You just need to empower them.

154

Gamifying Operational Excellence

Hack Days are important,

This POC was built in an afternoon.

155

Gamifying Operational Excellence

Getting the data was easy,

Finding interesting ways to use it is hard.

156

Gamifying Operational Excellence

Make it as easy as possible to do the right thing.

157

Gamifying Operational Excellence

Cheers !

Q & A