Inside GitHub with Chris Wanstrath
-
Upload
sv-ruby-on-rails-meetup -
Category
Technology
-
view
20.435 -
download
0
description
Transcript of Inside GitHub with Chris Wanstrath
hi
Hello.Hi everyone.
My name is Chris Wanstrath. I go by @defunkt online.
insidegithub
And today I’m going to talk about GitHub.
insidegithub
That’s me.
GitHub is what we like to call “social coding.”
You can see what your friends are doing from your dashboard or news feed
Everyone has a profile showing off their code and activity
And you can do things like leave comments on commits.
But it wasn’t always like this.
Originally we just wanted to make a git hosting site.
In fact, that was the first tagline.
git repository hosting
git repository hosting.
That’s what we wanted to do: give us and our friends a place to share git repositories.
a briefhistory
let’s start with a brief history
It’s not easy to setup a git repository. It never was.
But back in 2007 I really wanted to.
I had seen Torvalds’ talk on YouTube about git.
But it wasn’t really about git - it was more about distributed version control.
It answered many of my questions and clarified DVCS ideas.
I still wasn’t sold on the whole idea, and I had no idea what it was good for.
CVS is stupid
But when Torvalds says “CVS is stupid”
and so are you
“and so are you,” the natural reaction for me is...
To start learning git.
At the time the biggest and best free hosting site was repo.or.cz.
Right after I had seen the Torvalds video, the god project was posted up on repo.or.cz
I was interested in the project so I finally got a chance to try it out with some other people.
Namely this guy, Tom Preston-Werner.
Seen here in his famous “I put ketchup on my ketchup” shirt.
I managed to make a few contributions to god before realizing that repo.or.cz was not different.
git was not different.
Just more of the same - centralized, inflexible code hosting.
This is what I always imagined.
No rules. Project belongs to you, not the site. Share, fork, change - do what you want.
Give people tools and get out of their way. Less ceremony.
So, we set off to create our own site.
A git hub - learning, code hosting, etc.
We started with the code browsing and commit viewing...
But once we added the current version of the dashboard, we knew this was different.
And eventually “git repository hosting” gave way to “social coding”
Join 500,000 coders withover 1,500,000 repositories
Unleash Your Code
What’s special about GitHub is that people use the site in spite of git.
Many git haters use the site because of what it is - more than a place to host git repositories, but a place to share code with others.
2007 october
The first commit was on a Friday night in October, around 10pm.
2008 january
We launched the beta in January at Steff’s on 2nd street in San Francisco’s SOMA district.
The first non-github user was wycats, and the first non-github project was merb-core.
They wanted to use the site for their refactoring and 0.9 branch.
2008 april
A few short months after that we launched to the public.
Along the way we managed to pick up Scott Chacon, our VP of R&D
Tekkub, our level 80 support druid
Melissa Severini, who keeps us all in check
Kyle Neath, who makes the site pretty
Ryan Tomayko, who helps keep the site running smoothly.
Zach Holman, head of enterprise
Rick Olson, Rails extraordinaire
Eston Bond, Design Generalissimo
Corey Donohoe, Director of Shipology
And Brian Lopez, our bleeding edge cowboy
Oh yeah, and the other founders: PJ and Tom.
github.com
That’s where we’re at today.
So let’s talk about the technical details of the website: github.com
.com as opposed to fi, which I’m not going to get into today.
You’ll have to invite PJ out if you want to hear about that.
We also have a store
A job board
And do git training
the web site
As everyone knows, a web “site” is really a bunch of different components.
Some of them generate and deliver HTML to you, but most of them don’t.
Our site consists of four major code “frameworks” or “apps”
rails
1#GitHub.com, Gist, etc
resque
2#Background processing, 50ish different job types currently
smoke
3#All git calls happen over the wire
utils
4#Exception logging, stats, helper apps, etc
rails
We use Ruby on Rails 2.2.2 as our web framework.
It’s kept up to date with all the security patches and includes custom patches we’ve addedourselves, as well as patches we’ve cherry-picked from more recent versions of Rails.
rails
GitHub is about 20,000 lines of Rails code, not counting Rails itself, plugins, or gems.
We found out Rails was moving to GitHub in March 2008, after we had reached out tothem and they had turned us down.
So it was a bit of a surprise.
rails plugins
We currently have 27 Rails plugins installed, and that number is always changing.
shopify / active_merchant
lgn21st / s3_swf_upload
technoweenie / serialized_attributes
query_trace
query_analyzer
rubygems
GitHub depends on about 50 RubyGems
albino
ar-extensions
aws-s3
faker
faraday
github-markup
rdiscount
jekyll
gollum
redis-rb
rack
One of the big features in Rails 2.3 is Rack support.
We badly wanted this, but didn’t want to invest the time upgrading.
So using a few open source libraries we’ve wrapped our Rails 2.2.2 instance in Rack.
Now we can use awesome Rack middleware like Rack::Bug in GitHub
Coders created and submitted dozens of Rack middleware for the Coderack competition last year.
I was a judge so I got the see the submissions already. Some of my favoritewere
nerdEd / rack-validate
webficient / rack-tidy
talison / rack-mobile-detect
sets the X_MOBILE_DEVICE header to the mobile device, if recognized
unicorn
We use unicorn as our application server
- master / worker- 16 workers- preforking
unicorn
- instant restart after kill- hard 30s request timeouts- control ram growth
unicorn
- 0 downtime deploys- protects against bad rails startup- migrations handled old fashioned way
nginx
For serving static content and slow clients, we use nginx
nginx is pretty much the greatest http server ever
it’s simple, fast, and has a great module system
nginxLimit Zone
Limit simultaneous connections from a client
nginxLimit Requests
Limit frequency of connections from a client
Anti-DDOS
nginx
I see many people using Rack to do what the Limit modules do.
Don’t.
nginxmemcached
memcached support
can serve directly from memcached
nginxPush Module
comet!
git
The next major part of GitHub is git
grit
We wrote an open source library called Gritwhich lets us use git from Ruby
mojombo / grit
you can get it here
it originally shelled out to git and just parsed the responses.
which worked well for a long time.
gritFile.read()
Eventually we realized, however, that File.read() can be 100 times faster
gritsystem()
Than shelling out
One of the first things Scott worked on was rewriting the core parts of Gritto be pure Ruby
Basically a Ruby implementation of Git
mojombo / grit
And that’s what we run now
smoke
Kinda.
Eventually we needed to move of our git repositories off of our web servers
Today our HTTP servers are distinct from our git servers. The two communicate using smoke
smoke
“Grit in the cloud”
Instead of reading and writing from the disk, Grit makes Smoke calls
The reading and writing then happens on our file servers
bert-rpc
Rather than use Protocol Buffers or Thrift or JSON-RPC, Smoke uses BERT-RPC
bert-rpcbert : erlang ::json javascript:
BERT is an erlang-based protocol
BERT-RPC is really great at dealing with large binariesWhich is a lot of what we do
bert-rpc
we have four file servers, each running bert-rpc servers
our front ends and job queue make RPC calls to the backend servers
mojombo / bertrpc
You can grab bert-rpc on GitHub
mojombo / bert
Or if you just want to play with BERT
chimney
We have a proprietary library called chimney
It routes the smoke. I know, don’t blame me.
chimney
All user routes are kept in Redis
Chimney is how our BERT-RPC clients know which server to hit
It falls back to a local cache and auto-detection if Redis is down
chimney
It can also be told a backend is down.
Optimized for connection refused but in reality that wasn’t the real problem - timeouts were
proxymachine
All anonymous git clones hit the front end machines
the git-daemon connects to proxymachine, which uses chimney to proxy yourconnection between the front end machine and the back end machine (which holdsthe actual git repository)
very fast, transparent to you
mojombo / proxymachine
proxymachine can be used to proxy any kind of tcp connection
open source
ssh
Sometimes you need to access a repository over ssh
In those instances, you ssh to an fe and we tunnel your connection tothe appropriate backend
To figure that out we use chimney
node.js
node.jsdownloads
node.jsdownloadshttp => https <img>
node.jsdownloadshttp => https <img>event streams
hubot
jobs
We do a lot of work in the background at GitHub
resque
Currently we use a system called Resque.
defunkt / resque
You can grab it on GitHub
resque
- dealing with pushes- web hooks- creating events in the database- generating GitHub Pages- clearing & warmingcaches- search indexing
queues
In Resque, a queue is used as both a priority and a localization technique
By localization I mean, “where your workers live”
queuescritical,high,low
these three run on our front end servers
Resque processes them in this order
queuespage
GitHub Pages are generated on their own machine using the `page` queue
queuesarchive
And tarball and zip downloads are created on the fly using the `archive` queue on our archiving machines
search
On GitHub, you can search code, repositories, and people
solr
Solr is basically an HTTP interface on top of Lucene. This makes it pretty simpleto use in your code.
We use solr because of its ability to incrementally add documents toan index.
Here I am searching for my name in source code
solr
We’ve had some problems making it stable but luckily the guys at Pivotalhave given us some tips
Like bumping the Java heap size.
Whatever that means
database
Our database story is pretty uninteresting
mysql
We use mysql 5
master / slave
All reads and writes go to the master
We use the slave for backups and failover
caching
On the site we do a ton of caching using memcached
fragments
We cache chunks of HTML all over
Usually they are invalidated by some action
fragments
Formerly we invalidated most of our fragments using a generation scheme,where you put a number into a bunch of related keys and increment itwhen you want all those caches to be missed (thus creating new cache entries with fresh data)
fragments
But we had high cache eviction due to low ram and hardware constraints, and foundthat scheme did more harm than good.
We also noticed some cached data we wanted to remain forever was being evicted due to the slabs with generational keys filling up fast
page
We cache entire pages using nginx’s memcached module
Lots of HTML, but also other data which gets hit a lot and changes rarely:
page
- network graph json- participation graph data
Always looking to stick more into page caches
object
We do basic object caching of ActiveRecord objects such as repositories and users all over the place
Caches are invalidated whenever the objects are saved
associations
We also cache associations as arrays of IDs
Grab the array, then do a get_multi on its contents to get a list of objects
That way we don’t have to worry about caching stale objects
walker
We also have a proprietary caching library called Walker
walker
It originally walked trees and cached them when someone pushed
But now it caches everything related to git:
walker
- commits- diffs- commit listing- branches- tags- everything
Every git-related page load hits Walker a lot
walker
For most big apps, you need to write a caching layerthat knows your business domain
Generic, catch-all caching libraries probably won’t do
events
An example of this is our events system
This is one fragment
Each of these is a fragment
They’re also cached as objects
As well as a list of ids
And that’s just for the dashboard...
optimizations
So what other optimizations have we done
asset servers
Well we do the common trick of serving assets from multiple subdomains
asset serversassets0.github.comassets1.github.com
and so forth
sha asset id
Instead of using timestamps for asset ids, which may end up hitting the diskmultiple times on each request, we set the asset id to be the sha of the last commitwhich modified a javascript or css file
sha asset id/css/bundle.css?197d742e9fdec3f7
/js/bundle.js?197d742e9fdec3f7
Now simple code changes won’t force everyone to re-download the css or js bundles
bundling
For bundling itself, we use
bundling
yui’s compressor for css and
bundling
google’s closure compiler for javascript
we don’t use the most aggressive setting because it means changingyour javascript to appease the compression gods, which we haven’t committed to yet
scripty 301
Again, for most of these tricks you need to really pay attention to your app.
One example is scriptaculous’ wiki
scripty 301
When we changed our wiki URL structure, we setup dynamic 301 redirects for the old urls.
Scriptaculous’ old wiki was getting hit so much we put the redirect into nginx itself -this took strain off our web app and made the redirects happen almost instantly
ajax loading
We also load data in via ajax in many places.
Sometimes a piece of information will just take too long to retrieve
In those instances, we usually load it in with ajax
If Walker sees that it doesn’t have all the information it needs, it kicks off a jobto stick that information in memcached.
We then periodically hit a URL which checks if the information is in memcached or not. If it is, we get it and rewrite the page with the new information.
We use this same trick on the Network Graph
Fork Queue
ajax loading
and anywhere else it makes sense.
comet loading
very soon this will all be comet, though
monitoring
what do we use for monitoring?
nagios
Our support team monitors the health of our machines and coreservices using nagios.
I don’t really touch the thing.
Here’s a screenshot from my IE browser, complete with the ICQ plugin
resque web
We monitor our queue using Resque’s included Sinatra app
haystack
We use an in-house app called Haystack to monitor arbitrary information,tracked as JSON.
Here’s an example of Haystack’s “exceptions” view
collectd
We also use collectd to monitor load, RAM usage, CPU usage, and otherapp-related metrics
pingdom
pingdom sends us SMSes when the site is down
it’s nice
tender
tender is what we use for customer support
it works incredibly well, and they’re constantly improving it
testing
Our testing setup is pretty standard
test unit
We mostly use Ruby’s test/unit.
We’ve experimented with other libraries including test/spec, shoulda, and RSpec, but in the endwe keep coming back to test/unit
git fixtures
As many of our fixtures are git repositories, we specify in the test what sha we expect to be the HEAD of that fixture.
This means we can completely delete a git repository in one test, then have it back inpristine state in another. We plan to move all our fixtures to a similar git-system in the future.
machinist
We use machinist for our fixtures
notahat / machinist
running_man
Gives us setup_once
Use it to cache machinist fixtures on a per-test-class basis
technoweenie / running_man
ci joe
We use ci joe, a continuous integration server, to run on tests after each push.
He then notifies us if the tests fail.
defunkt / cijoe
You can grab him at github
staging
We also always deploy the current branch to staging
This means you can be working on your branch, someone else can be working on theirs,and you don’t need to worry about reconciling the two to test out a feature
One of the best parts of Git
security
github.com/security
having a security page really helps
we get weekly emails to our security email (that people find on the security page)
and people are always grateful when we can reassure them or a answer their question
regular audits
if you can, find a security consultant to poke your site for XSS vulnerabilities
having your target audience be developers helps, too
24/7 monitoring
24/7 monitoring is cool too
backups
backups are incredibly important
don’t just make backups: ensure you can restore them, as well
sql
we keep nightly, off-site backups of our sql databases
git
and the same for all our git repositories
the future
svn
pull requests
organizations
...and more
Questions?thanks for coming
Thanks.thanks for coming