Rails Operations - Lessons Learned

Post on 18-Dec-2014

687 views 1 download

description

 

Transcript of Rails Operations - Lessons Learned

Rails OperationsLessons learned from deploying and managing hundreds

of Rails applications

Thanks for coming out this morning. I know it’s hungover oclock, so it means a lot. You are dedicated, upstanding individuals.

Oh hi, I’m Josh

• @techpickles

• http://github.com/technicalpickles

• http://technicalpickles.com

I am from the internet

Awesomeness Engineer of Supreme Versatility

II

My official title is Awesomeness Engineer of Supreme Versatility. 2. (I recently was promoted)

Managed hosting and operations

We’re mostly known for our hosting. What isn’t as well known is our managed services. For this, we engage more closely with our customers.

When bringing on new managed customers, we work with them to spec out servers, review application’s needs. We get them up and running on these servers with our configuration management tool, moonshine. And once deployed, we provide 24x7 monitoring. If you’re server goes down, we let you know, and get it back online as soon as possible, regardless of when it happens.

And that’s not all. Once live, we provide operational support. Anything from application performance analysis, recommending architecture improvements, installing and managing new software on servers, or just being there to give feedback on how the application is operating.

You can basically think of us as a Rails Operations company.

I’m talking aboutRails Operations

Conveniently enough, I’m talking about Rails Operations today.

WTF isRails Operations?

I found this hard to distill down to a simple statement.

I think it’s safe to say that the majority of us are developers. We write code, build applications, launch products.

A lot of organizations, operations is something different. eople associate operations with system administration. And to an extent, this can be fairly accurate. Different people, different teams, different. As developers, we write some code, and toss it over wall, and let _them_ handle it.

I think this is a bit flawed. The code you write has an operational impact. The systems you run it on have an operational impact on your code. It’s a complex relationship, and when developer and operations teams are separate, it’s hard to bridge the gap between, since it’s neithers responsibility.

Development and maintenanceof a production Rails application

The simplest definition I’ve found is this.

Very important assumptionYou develop code that will eventually go into production,

and in part to some business model, generate revenue

That is to say, you are part of some organization

Before we dig in too deep...

Let’s talk about the business. We need to start with where development and operations fit within the rest of The Business.

QuestionDoes development generate revenue?

• Takes place on laptops, desktop machines, staging servers

• No real users

• Unknown if it truly works

• Tests are green, but...

NO

but it CREATES potential revenue

• Step 1: Development

• Step 2: .......

• Step 3: PROFIT

QuestionDoes operations generate revenue?

• Lives on servers located in data centers and clouds

• Real users

• Either code works, or it doesn’t

• Either the application is available or not

NO

Just because your application works in production, doesn’t mean people are using it or buying your product.

but it PRESERVES potential revenue

If you have good operations, that means users will be able to see your application working and actually be able to use it.

• Step 1: Development

• Step 2: Operations

• Step 3: ......

• Step 4: PROFIT!

QuestionUh, what generates revenue?

Million Dollar Question

• Working features (or at least that work enough)

• Infrastructure to keep the application up and running (or at least up enough)

• A business model

• Sheer determination

• Good luck

Lessons learned

Alright. I’ve given you a definition of Rails Operations, and had a brief detour to talk about the business and where development and operations fit into it.

Now for some lessons. Basically, I’ll be going over some patterns, some antipatterns, and other practices and topics.

Common threads

Putting this all together, I kept coming back to some common threads. That is, some ideas that apply to many aspects. I’m going to start you off with a few together, and then just jump into the lessons. We’ll probably pick up a few more along the way.

Give a damn

If you don’t care about what you’re doing, everything else I’m talking about today probably doesn’t matter. I don’t think you need to worry about this though, since you are here.

Earlier we talked about how operations preserves revenue. To that end, our goal is to mitigate risk as much as makes sense.

Tradeoffs and compromise. Each possible solution has them. The trick is understanding that there are tradeoffs. What tradeoffs you make depends on what your priorities are. For example:

* Dollar signs * Time * Sanity * Technical debt * Higher risk

Configuration Management

Pattern

It’s about managing configuration.

duh.

You write code that manages your servers’

configuration

Take a moment to think about how you might describe a server to someone. There’s plenty of nouns:

* packages * users * files * cronjobs * services

And some verbs:

* running commands

• apache package is installed

• apache service is running

• deploy user exists

• cron jobs

• etc

• Moonshine

• Puppet

• Chef

Automation

Bootstrapping. Anyone that has setup a new server from scratch can tell you... it’s time consuming, labor intensive, and error prone.

Bootstraping is just part of it though, only ever happens once though. What’s more interesting is that you can use this to manage your infrastructure as it involves. Need to start using redis? Just add it to your configuration management, and you’ll have it next deploy.

The best way to illustrate why you should be using configuration management is to explore the consequences of not using it.

Imagine it’s time to add a new application server. Your application is under heavy load, and needs this server to be up and serving requests. How long will it take you to get it up? And how will you know it’s setup correctly? If you’re doing this all manually, you can’t really know the answers to these questions.

Here’s another example. Adding a new dependency to your application. It can be a gem, a native package, a new daemon, whatever. How do you ensure this gets on the server when you need it? Deploy and pray? Log into the server and install it yourself? This sucks, and kind of risky especially if you’re talking about production.

As always, there’s tradeoffs to be made.

Setting up and learning how to do configuration management takes time. Time that could be spent working on user-facing tasks.

Taking on risk of having to cold deploy, or having deploys fail because of missing dependencies.

Usually, the balance is to have to take the risk and have it burn you enough times that it’s more painful to not stop and get your configuration management on, that it is to not do so.

If you do know it, it’s a no brainer. Just DO IT.

Staging ServersPattern

Preproduction servers

Staging servers are all about being a testbed between

Helps ensure correctness of deploy

configuration management

+staging servers

=VERY YES

If you use configuration management, and have staging servers, then this is a huge win.

We talked about adding new dependencies earlier. If you are doing configuration management, then staging is the first place you can see if ur doing it right.

There’s basically no downside to using staging servers. The only tradeoff though is that servers do cost dollar signs and staging servers are no different. This leads us to a new thread...

Maths... look around you. In most cases, you can do some dollar sign math to justify costs of a thing. Let’s try this.

A staging server may cost $60/mo

But how can you calculate the cost of not having a staging server? Let’s assume that if you don’t have a staging server, you’re bound to do a bad deploy that it could have prevented. Some code that doesn’t work outright, or is otherwise flawed. Let’s say it causes an hour of downtime while you determine the problem and try to fix it. Do you know how much it costs your business in lost revenue to be down an hour?

This is actually a pretty mature question, and I’d be surprised if many people can answer it off hand. In any event, I think we can do some fuzzy math to say yeah, it probably is more than $60. If that’s the case, then one failed deploy a month is enough to validate a staging server.

Repeat after me• development

• staging

• production

capistrano-gitflow

Whenever possible, I like to enforce standard by means of automation

For the flow of code from development -> staging -> production, we have capistrano-gitflow. Originally done up by apinstein, I did some refactorings and cleaned it up enough to be usable as a gem

Effectively, this enforces development -> staging -> production. Whenever you deploy to staging, it tags the current branch including information about the date, the user deploying, and a small blurb about the changes. Assuming this is cool, you can promote a tag to production and go on from there. If you haven’t deployed to staging yet, you’ll be promtpted and it will default to using the last production tag.

Deploy early, deploy often

Pattern

A play on release early, release often.

Although technically, I guess it’s the same

It’s basically the same thing we hear in the open source community.

The sooner you release code, the sooner you can validate it and the sooner you can get feedback. Does it work? Does it not break the entire site? Are users happy?

By deploying early and often, we’re also limiting risk. The less changes that go out in a single deploy, the less things there are that can possibly break. By waiting to deploy, you’re accumulating a larger set of changes to deploy, and therefore there’s more surface area to debug if it breaks.

In a way, you can consider undeployed code a liability.

Imagine spending a day or two doing some code cleanups to get ready for a sprint. Should you deploy when you are done and happy with the refactorings, or should you go ahead and do your sprint.

If it were me, I’d deploy the refactorings first. That way, the code is out there, and you’ll know if it performs equally to its nonrefactored version. It’s really easy to introduce performance killing changes in even a few line diff.

If you instead wait and deploy with new features, if anything goes awry, you have significantly more code to spelunk to track down a potential problem.

Feeling Driven Development

Antipattern

Oh feelings.

The front page feels slow

The primary key seems like it’s increasing

rapidly

IO seems high

What does it even mean?

This drives me nuts. By saying something ‘feels’ slow, there’s an implied assumption. The assumption is that it should be fast. Saying it like that is...weird, because it gives no indication of what is slow or not.

The trick is in determining what the assumption is, and then finding a way to measure and identify the problem.

How can we do this?

Science Driven Development

Counterpattern

Metrics everywhere!

With the right tools, you can easily be continuously collecting data so you have it in your pocket when you need it.

• New Relic - http://newrelic.com

• Scout - http://scoutapp.com

These are the two we use and highly recommend.

New Relic is really great for giving a high level view of your application. We’re talking at the request response level, including all sorts of fun maths with most time consuming requests, highest standard deviation, etc. It also breaks down requests by where time spent. Like if it’s all in the view, the controller, the database, partials, etc etc

Scout is useful for other reasons. While New Relic is good for high level understanding of your application, Scout is a bit more low level. You can use it to collect metrics about your servers, and how well they are running. Memory, CPU, disk space, IO, mysql connection stats, and so on.

I really believe these are a great combination, because New Relic can point you in the direction of a problem area, and Scout can better understand what’s contributing to it at a system level.

The front page feels slow

The front page is taking 10 seconds to load, but we really need it to be loading in under 1 second

The primary key seems like it’s increasing

rapidlyThe primary key is at 90% of it’s maximum, up from 80% yesterday, and looks like it’ll run out overnight.

IO seems highIO fluctatues up to 90% sometimes, but doesn’t appear

to have a negative effect

MonitoringTopic

How do you know when everything is

awful?

How would you prefer to know?

• Angry tweets

• Angry email from your boss

• You personally checking everything all the time

• An automated system to let you know

• Nagios

• Scout

What to monitor

It’s not a problem til it’s a problem

Define priority

Does it wake someone up?

Must be actionable

Single point of contact

If everything is awful, needs to be a single point of contact. They take point, acknowledge and begin looking into it. If need be, bring on others

Vertical scalingPattern

Your app is slowNow what?

Resources are (relatively) cheap

Developers are (relatively) expensive

Imagine having memory issues.

As always there’s a balance.

Remember, it’s a tradeoff to optimize for developer time by vertically scaling. It buys you time to either deal

Hipster StackAntipattern

“I read a blog post about how mongo is

totally web scale”

Cargo cult operations

Remember what’s important for th ebusiness? Do you want to become the expert at <insert technology here>? Is it really the most valuable thing you can be doing?

If you’re still going to go hipster...

• experiment in branches

• understand operational impact

• Staging!

Test in productionWait, what?

Further Reading

• Web Operations - John Allspaw and Jesse Robins

• Continuous Delivery - Jez Humble and David Farley

• “Web Operations for Developers 101”

http://www.amazon.com/Web-Operations-Keeping-Data-Time/dp/1449377440/ref=sr_1_1?s=books&ie=UTF8&qid=1314447411&sr=1-1

http://www.amazon.com/Continuous-Delivery-Deployment-Automation-Addison-Wesley/dp/0321601912/ref=sr_1_4?s=books&ie=UTF8&qid=1314447411&sr=1-4

http://www.paperplanes.de/2011/7/25/web_operations_101_for_developers.html

Fin.

Want to talk ops?find me here

josh@railsmachine@techpickles

Do you like these things?

• Rails

• Operations

• Ping Pong

• Beer

We are hiring