Leading a Successful DevOps Transition. Lessons from the Trenches
-
Upload
truongmien -
Category
Documents
-
view
219 -
download
3
Transcript of Leading a Successful DevOps Transition. Lessons from the Trenches
Leading a Successful
DevOps TransitionLessons from the Trenches
Randy Shoup
Consulting CTO
What Is DevOps?
• Continuous Delivery?
– Rapid cycle times
– Automated testing and Continuous Integration
– Deployment automation and version control
• Lean Management Practices?
– Limiting work-in-progress via small batch sizes
– Rapid feedback via visual displays and monitoring
• Collaborative approach to Development and Operations
– Act as one team across different disciplines
– Solve problems instead of pointing fingers
• Organizational and cultural factors are most important
Taking The DevOps Journey
• Traditional Enterprises Adopting DevOps
– Financial Services: Capital One, ING, Bank of America, Nationwide
– Manufacturers: General Electric, General Motors, Raytheon, Intel, Cisco, HP
– Retailers: Target, Nordstrom, Macy’s
• Higher Throughput and Stability
– High-performing IT organizations have 60x fewer failures and recover 168x faster
– High-performing IT organizations deploy 30x more frequently with 200x shorter lead times
• Improved Business Results
– Public companies with high-performing IT organizations had 50% higher growth in market capitalization over 3 years vs. low-performing IT organizations
Using Conway’s Law
• Organization determines architecture
– Design of a system will be a reflection of the communication paths within the
organization
• Agile, modular system requires an agile, modular organization
– Small, independent teams lead to more flexible, composable systems
– Larger, interdependent teams lead to more monolithic systems
• We can engineer the system we want by engineering the organization (!)
Small “Service” Teams
• Team develops a single set of applications or services
– Clear, well-defined area of responsibility
– Minimal, well-defined “interface”
• Amazon “Two Pizza Team”
– No team should be larger than can be fed by 2 large pizzas
– Typically 3-5 people
– Mix of junior and senior people
Small “Service” Teams
• End-to-End Ownership
– Cross-functional team owns application / service from design to deployment to
retirement
– Able to move very rapidly and independently
• Self-Sufficiency
– Team has inside it all the skill sets to do the job
– Depends on other teams for supporting services
• “You Build It, You Run It”
– The same team that builds the software operates the software
– No separate maintenance or sustaining engineering team
Lose the Ticket Culture
Ticket Culture Ownership Culture
Do what is asked for Do what is needed
One-way communication Two-way collaboration
Goal is to close the ticket Goal is product success
Reactive approach Proactive approach
Reinforces silos Reinforces collaboration
Prioritizes process Prioritizes results
Enforce a Service Mentality
• Vendor-Customer Discipline
– Service team is a vendor; the applications are its customers
– Service is useful only to the extent it provides value to its customers
• Customer can choose to use service or not (!)
– Customer team is responsible for deciding what is best for their use case
– Use the right tool for the right job
• Provides powerful incentives
– Service must be *strictly better* than the alternatives of build, buy, borrow
Charge for Usage
• Charge customers for *usage* of the service
– Aligns economic incentives of customer and provider
– Motivates both sides to optimize efficiency
• Free usage leads to waste
– No incentive to control usage or find more efficient alternatives
• E.g., App Engine usage at Google
– Charging particularly egregious internal customer led to 10x reduction in usage
Shared On-Call Duties
• All members of the team rotate on-call responsibilities
– Strong motivator to build in solid monitoring and diagnosis tools
– Best way to learn the real-world behavior of the system
– Best way to develop empathy for customers and other team members
• Common resistance
– Unfamiliarity with production systems and tools
– Fear of making a mistake
– “That’s not my job”
Shared On-Call Duties
• On-call “apprenticeship”
– Apprentice starts as secondary on-call with an experienced primary, observes and
learns from the primary in action
– Apprentice next takes primary on-call with an experienced secondary
– Apprentice graduates
• Ops at Google
– Developers are on-call for first 6+ months of a new service
– Service can “graduate” to Ops coverage only after intensive review of its monitoring,
reliability, resilience, etc.
Turn Approvals Into Code
• Reduce or eliminate approval bodies
– E.g., eBay Architecture Review Board
– (-) Too late
– (-) Too slow
– (-) Too disengaged from details
• Package expertise in code
– Smart, experienced people build their knowledge into code
– Teams with specialized skills (databases, security, compliance, etc.) provide services, libraries, or tools
Turn Approvals Into Code
• E.g., Security at Google
– Provide secure foundations by maintaining lower-level libraries and services
– Provide self-service penetration tests, vulnerability assessments, etc.
• The best way to “enforce” a standard practice is with working code
Migrate to Microservices
• Single-purpose
• Simple, well-defined interface
• Independently testable
• Independently deployable
• Easy to understand and reason about
• Smaller surface area
A
C D E
B
Embrace the Cloud
• Rapid Provisioning and Deployment
– Minutes, not weeks
• API-driven infrastructure
– Automatable and repeatable
– Constrained threat surface
• Pay For What You Use
– No “utilization risk” from owning / renting
– If it’s not in use, spin it down
• Build on Provider’s Scaling and Security Expertise
– Few organizations have the security resources of Amazon or Google
Embrace the Cloud
• The 2010s of computing are like the 1910s of electric power
• Soon it will be just as common to run your own computing infrastructure as it
is to operate your own electric power generation
Build a Quality Culture
• Quality, Performance, and Reliability are “Priority-0 features”
– “Stop the line” if there is a degradation
– Equally important to users as product features or engaging user experience
• Developers responsible for
– Features
– Quality
– Performance
– Reliability
– Manageability
Build a Quality Culture
• Developers write tests and code together
– Continuous testing of features, performance, load
• Tests make better code
– Tests “have your back”
– Confidence to break things
– Confidence to refactor
• Tests help you move faster
– Catch bugs earlier, fail faster
– “Slow down to speed up”
Build a Quality Culture
• E.g., Development Process at Google
– Code reviews before submission
– Automated tests for everything
– Single searchable source code repository
• Internal Open Source Model
– Not “here is a bug report”
– Instead “here is the bug; here is the code fix; here is the test that verifies the fix”
Actively Manage Technical Debt
• Maintain sustainable and well-understood level of debt
– Measured by engineering effort to fix
– Plan for how and when you will pay it off
– Track feature work vs. accrued debt over time
• “Don’t have time to do it right” ?
– WRONG -- Don’t have time to do it twice (!)
– The more constrained you are on time and resources, the more important it is to do a
solid job the first time
Vicious Cycle of Technical Debt
Technical Debt
“No time to do it right”
Quick-and-dirty
Virtuous Cycle of Investment
Solid Foundation
ConfidenceFaster and
Better
Invest in Quality
Blameless Post-Mortems
• Post-mortem After Every Incident
– Document exactly what happened
– What went right
– What went wrong
• Open and Honest Discussion
– What contributed to the incident?
– What could we have done better?
Blameless Post-Mortems
• Take fear and personalization out of it
– Engineers will compete to take personal responsibility (!)
– “Finally we can fix that broken system”
• Focus on Learning and Improvement
– How should we change process, technology, documentation, etc.?
– How could we have automated the problems away?
– How could we have diagnosed more quickly?
– How could we have restored service more rapidly?
DevOps in Action
• eBay Search Ranking Improvements
– Which item should appear 1st, 10th, 100th, 1000th
– Before: Small number of hand-tuned factors
– Goal: Thousands of machine-learned factors
• Rapid experimentation and feedback
– Deployed hundreds of parallel A|B tests every day
– Full year of steady, incremental improvements
• $120M in incremental eBay revenue
Not Just for Unicorns
• DevOps practices have become mainstream
• High performance is achievable by any IT organization
• Organizational and cultural change requires a significant investment of time
and effort …
• … but the benefits are well worth it