All daydevops 2016 - Turning Human Capital into High Performance Organizational Capital

Post on 16-Apr-2017

563 views 0 download

Transcript of All daydevops 2016 - Turning Human Capital into High Performance Organizational Capital

Devops: Turning Human Capital into High Performance Organizational Capital

John Willis @botchagalupe

• One of the founding members of “Devopsdays” • Co-author of the “Devops Handbook”. • Author of the “Introduction to Devops” on Linux Foundation

edX. • Podcaster at devopscafe.org • Devops Enterprise Summit - Cofounder • Nine person in at Chef (VP of Customer Enablement) • Formally Director of Devops at Dell • Found of Socketplane (Acquired by Docker) • 10 Startups over 25 years

About Mehttps://github.com/botchagalupe/my-presentations

How would I describe Devops to a CEO?

How would I describe Devops to a CEO?

How would you describe Devops to a CEO?

The consequences of failure have never been greater…

The consequences of failure have never been greater…

Wanna know how?

Devops Practices and Patterns• Continuous Delivery

• Everything in version control • Small batch principle • Trunk based deployments • Manage flow (WIP) • Automate everything

• Culture • Everyone is responsible • Done means released • Stop the line when it breaks • Remove silos

13

itrevolution.com/devops-handbook

Human Capital and High Performance

Organizations

30x 200xmore frequent deployments

faster lead times

60x 168xthe change success rate

faster mean time to recover (MTTR)

2x 50%more likely to exceed profitability, market share & productivity goals

higher market capitalization growth over 3 years*

High performers compared to their peers…

Data from 2014/2015 State of DevOps Report - https://puppetlabs.com/2015-devops-report

Recent IT Performance Data is Compelling

30x 200xmore frequent deployments

faster lead times

60x 168xthe change success rate

faster mean time to recover (MTTR)

2x 50%more likely to exceed profitability, market share & productivity goals

higher market capitalization growth over 3 years*

High performers compared to their peers…

Data from 2014/2015 State of DevOps Report - https://puppetlabs.com/2015-devops-report

Recent IT Performance Data is Compelling

Faster

HigherQuality

MoreEffective

30x 200xmore frequent deployments

faster lead times

60x 168xthe change success rate

faster mean time to recover (MTTR)

2x 50%more likely to exceed profitability, market share & productivity goals

higher market capitalization growth over 3 years*

High performers compared to their peers…

Data from 2014/2015 State of DevOps Report - https://puppetlabs.com/2015-devops-report

Recent IT Performance Data is Compelling

Faster

HigherQuality

MoreEffective

2555x

Fast

CheapGood

“Pick Two!”

Conventional Wisdom

Fast

CheapGood

“Pick Two!”

Conventional Wisdom

Faster, Better, and Cheaper?

Organizational culture was one of the strongest predictors of both IT performance and the overall performance of the

organization

Devops is about Humans

20

Devops is a set of practices and patterns that turn human

capital into high performance organizational capital.

Devops is about Humans

20

Devops is a set of practices and patterns that turn human

capital into high performance organizational capital.

Google

• Over 15,000 engineers in over 40 offices • 4,000+ projects under active development • 5500+ code submissions per day (20+ p/m) • Over 75M test cases run daily • 50% of code changes monthly • Single source tree

Google

• Over 15,000 engineers in over 40 offices • 4,000+ projects under active development • 5500+ code submissions per day (20+ p/m) • Over 75M test cases run daily • 50% of code changes monthly • Single source tree

• Over 75M test cases run daily

Amazon

• 11.6 second mean time between deploys. • 1079 max deploys in a single hour. • 10,000 mean number of hosts

simultaneously receiving a deploy. • 30,000 max number of hosts simultaneously

receiving a deploy

24

Unicorns and Horses (Enterprises)

Unicorns

Enterprise

Shamelessly stolen and repurposed from: Pete Cheslock

Enterprise Organizations

• Ticketmaster - 98% reduction in MTTR • Nordstrom - 20% shorter Lead Time • Target - Full Stack Deploy 3 months to minutes • USAA - Release from 28 days to 7 days • ING - 500 applications teams doing devops • CSG - From 200 incidents per release to 18

Faster, Better, and Cheaper. How?

Lean Safety Culture Learning Organization

Lean

Service now

Parts Unlimited - "Major Release 6"

Early 2014

Project Initiation

ZRA (finance)

Approve Project

Monthly Steering Meeting

Portfolio

C-level

Steering Comittee

Provides Input

Project Charter

High-Level• Stories• Project Info• Description• Budget• Schedule

PMStakeholders (Tech and Biz)

Create Work Breakdown

Work Breakdown (MS Proj)

High-Level• Milestones• Resource

Planning

3 months 3 monthsHold / Pause

Create Requirements

(Project Meeting)

MS Office

• Detailed Req for new features

• Technology refreshes

• ERD (Infra req)• DRD (Dev req)• BRD (Biz req)

Share Point

Create Design

Tech ReqTech

ReqTech Req

Tech Leads Architects Vendor Arch

Ops Arch

High-LevelServer Tickets

3 months

Receive Request for

Servers

Create Server

Request Spreadsheet

ServerReq

PMTixattach

Route for Approval

Tix

1 week 1 week

• Budget• Appropriate

Resources DB

App or Web

orApproved Into Ops

Delivery Queue

Delivery Manager

"Matt"

Service now

"Heads up"

Assign to Delivery Engineer

Delivery Engineer

Clarify or Confirm Req with Dev or

QA

1 - 6 weeks

Provision Server

and Rework

DBA Validation

App/Web Validation

RestoreData

1 weekApp

Team

App Team

PMStakeholders (Tech and Biz)

Dev Leads

4 weeks

ARB Queue

Detailed Analysis and Requirements

Jira "Stories"

Maybe

Track Ticket Dependencies

Confluence Pages

Team Leads and PMs

Assign Requirements

add more detail for their teams

Architecture Review Board

"Bill" plus Architects

Working Group

Ops? (sometimes)

Devs, PM, Engr, QA

Development Sprint

2 week c/t

Existing Dev Environments

Acquire / Prepare needed

dataOps DBA

Service Data Setup

(Mainframe)

"Jennifer"

Test Data Configuration

Manager

Development Deploy to Integration

Dev, QA

Integration & Regression

Testingfocused on service

ScrumDev/QA

Integ03

ScrumDev/QA

Test Link

Sprint Review

Release to Prod

Product Owners(Using own

criteria)

Create CAB ticket

or

Scrum Team Ops Team(if legacy)

Push Deployment to Stage

Stage

Email Notification

Jira

NewArch

Build VMs

Jira

Ops

ServiceNow

Legacy

QA LeadPMsQAs

End to end testing in Prod

Prod Env

PrdDB

Go-No Go decision meeting

Team Leads

Jira

Ops

By Cluster

"Remove Feature Flag"

(if new arch)

16 weeks

6 weeks H/C: 6 3 weeks H/C: 8

4 weeks H/C:8 3 weeks H/C: 14

Data Setup Integration Testing

DEv Arch

Create Change Tickets > 100

Service Now

ComputeNet

FacilityCablingStorage

"Linda"Ops PM

RESET DELIVERY

DATE!

Steering Comittee

Fix Tickets!

"Linda" Ops PM

Dev Leadership

Assign Dev Team

Ops Intake Meeting

Dev Leadership

1 week

GroupCIOs and

Arch Leads

QA

SteeringDesign

Dev BreakdownDev / Test

Staging Release

Server Requirements GatheringServer Approval and Assignment

Provisioning

Production Release

Initiation and Planning

Create OpsTickets

TS PD

TS PD

Gaps in Requirements• Licenses• Dependencies on 3rd party apps• Capacity planning always seem low

("robbing Peter to pay Paul")• Don't purchase in advance even though

we know it's coming

Duplicate info across different documents

EP

D

D

Procurement of physical servers can take months (lead times for procurement plus facilities groups)

Too many Env. in on ticket cases audit confusionPiecemeal requests ("2 this week, 3 next week")

1 queue for delivery team with ~1,000 tickets at once

Capacity issues cause delay

Often told to stop everything and do something else

TS

D

M

TS

M W

W TS EP

HNo monitoring or backup for some environments

30% of delivery teams time spent "consulting" on performance and dealing with unfounded requests for more capacity

3-5 days to fix~10% S/R

H

D M

TS

H

Often skips CAB. What CAB reviews is often not what built

All manual setup. 1 person really knows how. Low data quality.

Manual process with lots of back and forth.

Many tickets with mismatched priorities

Mostly manual testing

Manual, per clusterFrequently down.

External service updates take offline. Lots of contention.

EPM

D

PDM W

TS

TS D

M TS

PDM

M

S/R - 90%

S/R - 55%

S/R - 15%

D

S/R - 20%

S/R - 50%

Sometimes submits server requests directly to delivery Ad-hoc requests get

lost, maybe 2-3 week delays

TS

High Level

S/R - 75%

9+ months of planning before implementation starts

(and information / requirements still incorrect or incomplete!)

Dev and QA told to submit sever request 6-8 weeks in advance (only done 50% of time)

W5. New "white glove" engagement model

3. Standard product catalog("Environments on Demand")

2. Visualization of flow of work and expected upcoming work

4. Shorten from Design to Implementation

1. Fully Automated Environment Provisioning

7. Small Batches

8. Write end-to-end customer

func. tests

11. Resolve interface to

legacy

10. Test data setup

automation

13. Dev Deploy to Prod for legacy

14. Unify change

management tools

15. Tool

9. Service Verification test writing: shift left to Dev(test early)

12. Remove Bottleneck and Environment Contention(test more)

• Make the work visibile for all • Manage flow and eliminate waste • Build alignment and consensus across team boundaries • Empower teams to find and fix what is getting in the way

• Small Batch • Reduce Work in Process (WIP) • 1x1 Flow • Reduce Bottlenecks (TOC) • Optimize Globally

Where does lean come from?

Where does lean come from?

Where does lean come from?

Let’s talk Kata

I fear not the man who has practiced 10,000 kicks

once, but I fear the man who has practiced one

kick 10,000 times

- Bruce Lee

Toyota is not a story about techniques. It’s an organization defined primarily by the unique behavior routines it continually

teaches to all it’s members.

Mike Rother (Page 262-263)

I have no idea how to answer

that question. It would literally

never occur to me not to do it!

KATA

We are what we repeatedly do. Excellence, then, is not

an act, but a habit.

Aristotle

Safety Culture

Views on Human Error

▪ Views on Human Error

▪ The old view of human error (First Story)

▪ Human error is the cause of accidents ▪ To explain failure,you must seek failure ▪ You must find people’s: inaccurate assessments,wrong decisions, bad judgments

▪ Views on Human Error

▪ The new view of human error (Second Story)

▪ Human error is a symptom of trouble deeper inside a system ▪ To explain failure, do not try to find where people went wrong ▪ Instead, find how people’s assessments and actions made sense at the time, given the circumstances that surrounded them

▪ Bad Apple Theory - Throw away the bad apples

▪ Complex systems are basically safe, they need to be protected from unreliable people (bad apples) ▪ Human errors cause accidents: humans are the dominant contributor to more than two thirds of mishaps ▪ Errors occur because of human loss of situation awareness, complacency, negligence ▪ Errors are introduced to the system only through the inherent unreliability of people.

What can go wrong usually goes right, but then we draw the wrong conclusion.

Murphy’s Law is Wrong! Sidney Dekker The Field Guide to Human Error

Your organization must continually affirm that individuals are NEVER the ‘root cause’ of outages.

▪ Hindsight bias: ▪ knew-it-all-along, to see the event as having been predictable, counterfactuals

▪ Outcome bias: ▪ evaluating the quality of a decision when the outcome of that decision is already known

▪ Availability bias: ▪ preference by decision makers to information and events that are more recent

▪ Fundamental attribution error: ▪ explain behavior in terms of internal disposition, such as personality traits, abilities, motives, etc. as opposed to external situational factors

▪ Just Culture at Etsy (John Allspaw)

▪ Encourage learning by having these blameless Post-Mortems on outages and accidents

▪ Understand how an accidents happen, in order to better equip ourselves from it happening in the future

▪ Gather details from multiple perspectives on failures, and we don’t punish people for making mistakes

▪ Enable and encourage people who do make mistakes to be the experts on educating the rest of the organization how not to make them in the future

▪ Just Culture at Etsy (John Allspaw)

▪ Accept that there is always a discretionary space where humans can decide to make actions or not, and that the judgement of those decisions lie in hindsight

▪ Accept that the Hindsight Bias will continue to cloud our assessment of past events, and work hard to eliminate it

▪ Accept that the Fundamental Attribution Error is also difficult to escape, so we focus on the environment and circumstances people are working in when investigating accidents

"In dynamic fault management, intervention precedes or is interwoven with diagnosis"

- Woods (1994)

Source: (Woods) John Allspaw - http://bit.ly/AllspawThesis

Learning Organization

That’s how it’s always been done

around here!

You are either building a learning organization… or you will be losing to someone who is

- Walter Sobchak

You are either building a learning organization… or you will be losing to someone who is

- Walter Sobchak - Andrew Clay Shafer

▪Dr Deming

A learning organization is a place where people are continually discovering how they create their reality.

- Peter Senge

Ladder of Inference Chris Argyris

• Action • Beliefs • Conclusions • Assumptions • Meanings • Select • Observe

Ladder of Inference

▪ Can create bad judgement ▪ Our assumptions can lead us to bad conclusions ▪ Question your assumptions and conclusions ▪ Seek contrary data ▪ Make your assumptions visible to others ▪ Invite others to test your assumptions and conclusions ▪ Inquire other peoples assumptions and conclusions ▪ Move down the ladder instead of up

Ladder of Inference - Bad Judgement ▪ Observe - Notice people in the first row ▪ Select - Person in front row keep looking at their phone ▪ Meaning - Not listening to my presentation ▪ Assumption - He is not interested ▪ Conclusion - Doesn’t like my new idea ▪ Beliefs - Their team always blocks new ideas ▪ Action - I send a nasty email to their boss

Ladder of Inference - Alternative Assumption ▪ Observe - I notice people in the first row ▪ Select - Person in the front row keep looking at their phone ▪ Meaning - Not listening to my presentation ▪ Assumption - Try and engage with a question (safely) ▪ Conclusion - Might find out that they are late for another meeting and they really don’t want to miss this one… so they sent an email noticing the next meeting team that they will be late…. ▪ Beliefs - They are very excited about this new idea ▪ Action - Both teams setup another meeting to engage.