Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

40
Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements

Transcript of Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Page 1: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Case StudyHow shifting to a DevOps Culture Enabled Performance

and Capacity Improvements

Page 2: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Greg BurtonSenior Software Engineer

Hotel Infrastructure and Platform

Ori RawlingsSenior Software Engineer

Hotel Shopping and Optimization

Page 3: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

• Over $11.4B in total gross bookings in 2013• Launched in 2001• ~1,600 employees• Chicago, IL

Page 4: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

What are we talking about today?

Capacity!

Page 5: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

What are we talking about today?

Capacity!2x

Page 6: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

What are we talking about today?

Development Operations

Stuff the business needed to get done

Page 7: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

What are we talking about today?

Page 8: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Global Platform

Page 9: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Legacy technology

• 10+ years old• 3+ version control systems• 2+ RPC frameworks• 3+ JDK major versions• Developers cycle through,

move on to other projects

Page 10: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

The Chasm of Responsibility

Development

Operations

Who?

• Designing, building, and testing end user features• Patching bugs in apps• Requesting production changes• Responding to pages during on-call rotation

• Monitoring site health and performance• Deploying changes to production• Network configuration• Database administration• Racking and bootstrapping servers

Page 11: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

The Chasm of Responsibility

• Performance tuning and optimization?• Performance regression testing?• Identifying capacity bottlenecks?• Maintenance of legacy performance tuning?

Development

Operations

Who?

Page 12: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Sourcing several areas of expertise

• Expertise on app internals• Know which features are important/can be changed• How services interact/collaborate to achieve functionality

• Knows roughly where pressure/pain points are in infrastructure• Understanding of JVM tuning• Sense of hardware capability

Development

Operations

Who?

Page 13: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Double the hotel search capacity of our entire

stack

Shared goal for Dev and Ops teams

Page 14: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Multiple limiting factors to reach search capacity goal

Capacity Goal

Limiting factors:

various hosts, apps, databases in search stack

Page 15: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Limiting factor: database capacity (1)

Frustration: not familiar with application code, but intuitively suspects that there are huge inefficiencies

Limited options: no budget to buy additional database capacity

Willing partner: excited to team up with developers to assess database load from multiple perspectives

Qaiser, Database Architect

Database load exceeds max recommended levels during peak traffic periods.

Page 16: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Limiting factor: database capacity (2)

Leveraging the Top 10 Query Report

Page 17: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Limiting factor: database capacity (3)

Instance 0

Database…

Internal cache

Instance 1

Internal cache

Instance N

Internal cache

refresh process

#1 and #3 top queries were offline processes to reload in-memory caches

#9 and #13 top queries were unnecessary and created by a bug in the code

Page 18: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Results: database capacity (4)

40% reduction in CPU usage

50% reduction in connection requests

Changes roll out over this period

Total time spent: 4 weeks

Page 19: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Limiting factor: Hotel Search Engine (1)

Instance 0

Database Consequence: horizontal scale was no longer a viable option because each new instance adds load to the database in order to refresh in-memory caches.

Legacy: for years, there was a reliance on horizontal scaling (adding more instances) to increase capacity.

Internal cache

Instance 1

Internal cache

Instance N

Internal cache

refresh process

Page 20: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Limiting factor: Hotel Search Engine (2)

Many ideas from many people…

Chosen direction: draw a line in the sand. Can we meet the capacity goal by tuning the host on existing hardware?

Opinion: host is badly tuned and uses hardware inefficiently

Opinion: developers are not aware of the costs of supporting so many instances.

Bobby, Director of Operations Center

feels like

Page 21: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Limiting factor: Hotel Search Engine (3)

Why hadn’t something as fundamental as JVM tuning been done?

Dev perspective: Let’s focus on features. There are people in Operations who make sure the overall host operates well.

Ops perspective:Developers are responsible for using their hosts intelligently and keeping the JVM tuned when they add features.Disconnect

Outcome: JVM tuning fell into an ownerless chasm between dev and ops.

Shift of perspective: a shared capacity goal encouraged a sense of host-level ownership.

Page 22: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Limiting factor: Hotel Search Engine (4)

Investigation: We suspected that the JVM was not tuned well, but did not have a deep understanding of JVM dynamics.

Our first thought: We don’t know much about doing this. Are there others at the company who are the “right people for this job”?

Our next thought: Why not us? Why don’t we learn aggressively and become experts ourselves? Why don’t we apply a holistic, methodical approach to tuning?

Page 23: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Limiting factor: Hotel Search Engine (5)

Hypothesis: New JVM tunings based on informed, methodical approach will increase capacity per instance.

Original tuning: undersized young generation of heap memory meant that garbage collections happened frequently and recovered little memory.

Live data

New tuning: larger young generation leads to fewer garbage collections and more memory recovered with each one (live objects have time to become garbage).

Young generation heap space

Filling up with request objects

Filling up with request objects

Live dataYoung generation heap space

Page 24: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Let’s just try it out in production!

Page 25: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Limiting factor: Hotel Search Engine (7)

Performance testing: There was no established culture of performance testing. We had to invest the time to establish it.

Drive load reliably and repeatably: We developed JMeter test suites.

Source: Mature Optimization Handbook, Carlos Bueno, 2013

Establish an environment: Operations and Development built it carefully to guarantee parity with production. This is the only way that testing can produce valid conclusions.

Establish a trusted benchmark: Before introducing changes, we reproduced the capacity bottleneck that we were trying to eliminate in production.

Page 26: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Limiting factor: Hotel Search Engine (8)

Test results: We confirmed our hypothesis. Tuning reduced the garbage collection overhead, which produced extra capacity.

Iterate quickly: the performance test environment allowed us to test many scenarios, zeroing in on the best one.

Trust: we used the same hardware as production, scaled down proportionally

Page 27: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Limiting factor: Hotel Search Engine (9)

Tunings deployed in production: this host was now capable of reaching our capacity goal, and we reduced the instance count by 25%

Not so fast! This was only one host in a complex system. The next limiting factor awaits!

Total time spent: 2 months

Page 28: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Limiting factor: Markup service (1)

Another opportunity for tuning, similar to what was done with the Hotel Search Engine

Result: we improved capacity to meet the target while reducing instance count from 110 to 16.

Payback for previous investments: was able to leverage this foundation to deploy this tuning effort in 1 week.

Performance testing

environment

Mature working relationship with

Operations

Growing knowledge base

Total time spent: 1 week

Page 29: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.
Page 30: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Limiting factor: Markup service (2)

But wait…it was not a storybook ending. Performance actually got worse after our deployment. Why?

Non-obvious problem: visible impacts at the application level had to be traced down to a cause at the operations level, a challenge requiring DevOps collaboration.

Production did not match performance test results: an environmental difference meant that we missed an important production dynamic.

Page 31: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Limiting factor: Markup service (3)

What we did with Operations

Guerilla-style meetings: Huddled around our desks and white boards whenever the work required it.

What we avoided

Working separately, passing deliverables over “the wall”. This minimizes learning and is not conducive to reaching a shared understanding of the problem.

Eliminate red herrings: did not accept hunches as facts, and did not settle for tempting workarounds

Shrink the haystack: debated hypotheses and planned experiments to test them. Falsified hypotheses are still valuable because they tell you what the problem is not.

Inconclusive experiments. Commit to a hypothesis and run an experiment, rather than “chasing” hypotheses in the middle of an experiment.

Page 32: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Host servers

Limiting factor: Markup service (4)

The root cause: TCP connection tracking table in the host servers was filling up and dropping packets

After tuning:TCP connections were spread over fewer instances and fewer host servers.

VM 0

Instance 0Instance 0VM 15

TCP connections for incoming

requests

Solution: Reduce the Conntrack table timeout for TCP connections in the TIME_WAIT state from several minutes to several seconds.

Page 33: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Tomcat

Limiting factor: Markup service (5)

Why did we miss this in our performance testing?

Tomcat behavior:Reuses connections up to a certain count. Above that, it opens and closes a connection for each request, generating much higher connection volume.

Markup host

Hundreds of unique clients

Client

Client

Client

Tomcat

Markup host

Scaled down number of clients

Client

The environmental difference:Production environment had enough clients to cross the Tomcat connection threshold. The performance testing environment did not.

ProductionEnvironment

Performance TestingEnvironment

Fewer overall connectionsTotal time spent: 4 weeks

Page 34: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Stability: Long-standing production issues (1)

Learning to live with something that’s wrong, rather than fixing it.

Why does it happen? Code problems that have operational consequences are not within the scope of Dev or Ops alone.

We end up with operational workarounds to development problems.

Page 35: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Stability: Long-standing production issues (2)

Specific problem: our webapp degrades after 1-2 weeks of uptime, requiring restarts to avoid impairment.

Reproduced issue in performance testing environment on the first day. This allowed us to quickly iterate through experiments.

Limited our scope to this specific issue by determining its exact profile with a series of metrics.

Ops was a strong partner, driven by a growing confidence in our ability to solve famously difficult issues together.

Dividend

Dividend

Dividend

Performance testing

environment

Mature working relationship with

Operations

Growing knowledge base

Page 36: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Stability: Long-standing production issues (3)

Valuable by-product: knowledge and approach began to spread rapidly as we started to see good results

No previous exposure to working with an Operations mindset. Learned voraciously from our recent experiences

Moved beyond feature-level perspective to consider the operational health of the entire host and hardware

Mental shift: operations-driven development is productive work

Ben, Developer

Guilty phase: operations issues take time away from “real work” i.e. feature work

Page 37: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Stability: Long-standing production issues (4)

Isolated the problem with experiments: leveraged the performance testing environment to test hypotheses

Every test must be conclusive. Negative conclusions are also valuable because they shrink the size of the “haystack” in which we are searching.

Inconclusive tests yield no progress. “This is probably not the cause” is not a valuable outcome.

Document the results of every test. We conducted 22 different tests, and the results can easily be confused or forgotten.

Exp 00: Reproduction of issue in performance testing environmentExp 01: Simplify JVM arguments to minimal set…Exp 12: Remove most flow execution listeners…Exp 21: Set CMSClassUnloadingEnabled to collect PermGenExp 22: Reuse Xstream instance rather than one per request

Page 38: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Stability: Long-standing production issues (5)

The root cause: by misusing a library, we introduced a custom classloader leak that gradually drove up minor garbage collection times.

Difficult for Dev to know or care about because the impact was limited to Ops.

Consequence: the misuse remained in the codebase for over a year before it was discovered!

Difficult for Ops to track down because the answer was in the application code. Settled for workarounds.

One line of code was the problem: it created a new object for every request, rather than reusing one

Total time spent: 2 months

Page 39: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Key Take-aways

• Establish shared goals to break down barriers between Dev and Ops, which leads to shared understanding of problems

• Commit to skepticism and use hypothesis testing as a contract to evaluate everyone’s ideas

• When justifying time spent vs. value produced, account for the investments that produce reusable value

• Recognize that bugs with no functional impact can have huge operational impact

Page 40: Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements.

Questions?

We’re hiring!

@OrbitzTalent

http://careers.orbitz.com