Crowdsourcing with Django
-
Upload
simon-willison -
Category
Technology
-
view
3.645 -
download
0
description
Transcript of Crowdsourcing with Django
![Page 1: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/1.jpg)
Crowdsourcingwith DjangoEuroPython, 30th June 2009
Simon Willison · http://simonwillison.net/ · @simonw
![Page 2: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/2.jpg)
“Web development on journalism deadlines”
![Page 3: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/3.jpg)
The back story...
![Page 4: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/4.jpg)
November 2000The Freedom of Information Act
![Page 5: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/5.jpg)
• http://www.guardian.co.uk/politics/2009/may/08/mps-expenses-telegraph-checquebook-journalism
• http://www.guardian.co.uk/politics/2009/may/15/mps-expenses-heather-brooke-foi
Heather Brooke
![Page 6: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/6.jpg)
2004The request
![Page 7: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/7.jpg)
January 2005The FOI request
![Page 8: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/8.jpg)
July 2006The FOI commissioner
![Page 9: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/9.jpg)
May 2007The FOI (Amendment) Bill
![Page 10: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/10.jpg)
February 2008The Information Tribunal
![Page 11: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/11.jpg)
“Transparency will damage democracy”
![Page 12: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/12.jpg)
May 2008The high court
![Page 13: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/13.jpg)
January 2009The exemption law
![Page 14: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/14.jpg)
![Page 15: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/15.jpg)
![Page 16: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/16.jpg)
March 2009The mole
![Page 17: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/17.jpg)
“All of the receipts of 650-odd MPs, redacted and unredacted, are for sale at a price of £300,000, so I am told. The price is going up because of the
interest in the subject.”Sir Stuart Bell, MP
Newsnight, 30th March
![Page 18: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/18.jpg)
8th May, 2009The Daily Telegraph
![Page 19: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/19.jpg)
At the Guardian...
![Page 20: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/20.jpg)
April: “Expenses are due out in a couple of months, is
there anything we can do?”
![Page 21: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/21.jpg)
June: “Expenses have been bumped forward, they’re out
next week!”
![Page 22: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/22.jpg)
Thursday 11th JuneThe proof-of-concept
![Page 23: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/23.jpg)
Monday 15th JuneThe tentative go-ahead
![Page 24: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/24.jpg)
Tuesday 16th JuneDesigner + client-side engineer
![Page 25: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/25.jpg)
Wednesday 17th JuneOperations engineer
![Page 26: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/26.jpg)
Thursday 18th JuneLaunch day!
![Page 27: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/27.jpg)
![Page 28: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/28.jpg)
![Page 29: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/29.jpg)
![Page 30: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/30.jpg)
![Page 31: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/31.jpg)
![Page 32: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/32.jpg)
![Page 33: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/33.jpg)
How we built it
![Page 34: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/34.jpg)
![Page 35: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/35.jpg)
![Page 36: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/36.jpg)
$ convert Frank_Comm.pdf pages.png
![Page 37: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/37.jpg)
![Page 38: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/38.jpg)
Models
![Page 39: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/39.jpg)
class Party(models.Model): name = models.CharField(max_length=100)
class Constituency(models.Model): name = models.CharField(max_length=100)
class MP(models.Model): name = models.CharField(max_length=100) party = models.ForeignKey(Party) constituency = models.ForeignKey(Constituency) guardian_url = models.CharField(max_length=255, blank=True) guardian_image_url = models.CharField(max_length=255, blank=True)
![Page 40: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/40.jpg)
class FinancialYear(models.Model): name = models.CharField(max_length=10)
class Document(models.Model): title = models.CharField(max_length=100, blank=True) filename = models.CharField(max_length=100) mp = models.ForeignKey(MP) financial_year = models.ForeignKey(FinancialYear)
class Page(models.Model): document = models.ForeignKey(Document) page_number = models.IntegerField()
![Page 41: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/41.jpg)
class User(models.Model): created = models.DateTimeField(auto_now_add = True) username = models.TextField(max_length = 100) password_hash = models.CharField(max_length = 128, blank=True)
class LineItemCategory(models.Model): order = models.IntegerField(default = 0) name = models.CharField(max_length = 32)
class LineItem(models.Model): user = models.ForeignKey(User) page = models.ForeignKey(Page) type = models.CharField(max_length = 16, choices = ( ('claim', 'claim'), ('proof', 'proof'), ), db_index = True) date = models.DateField(null = True, blank = True) amount = models.DecimalField(max_digits=20, decimal_places=2) description = models.CharField(max_length = 255, blank = True) created = models.DateTimeField(auto_now_add = True, db_index = True) categories = models.ManyToManyField(LineItemCategory, blank=True)
![Page 42: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/42.jpg)
class Vote(models.Model): user = models.ForeignKey(User, related_name = 'votes') page = models.ForeignKey(Page, related_name = 'votes') obsolete = models.BooleanField(default = False) vote_type = models.CharField(max_length = 32, blank = True) ip_address = models.CharField(max_length = 32) created = models.DateTimeField(auto_now_add = True)
class TypeVote(Vote): type = models.CharField(max_length = 10, choices = ( ('claim', 'Claim'), ('proof', 'Proof'), ('blank', 'Blank'), ('other', 'Other') ))
class InterestingVote(Vote): status = models.CharField(max_length = 10, choices = ( ('no', 'Not interesting'), ('yes', 'Interesting'), ('known', 'Interesting but known'), ('very', 'Investigate this!'), ))
![Page 43: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/43.jpg)
Frictionless registration
![Page 44: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/44.jpg)
![Page 45: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/45.jpg)
Page filters
![Page 46: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/46.jpg)
![Page 47: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/47.jpg)
page_filters = ( # Maps name of filter to dictionary of kwargs to doc.pages.filter() ('reviewed', { 'votes__isnull': False }), ('unreviewed', { 'votes__isnull': True }), ('with line items', { 'line_items__isnull': False }), ('interesting', { 'votes__interestingvote__status': 'yes' }), ('interesting but known', { 'votes__interestingvote__status': 'known'...)page_filters_lookup = dict(page_filters)
![Page 48: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/48.jpg)
pages = doc.pages.all() if page_filter: kwargs = page_filters_lookup.get(page_filter) if kwargs is None: raise Http404, 'Invalid page filter: %s' % page_filter pages = pages.filter(**kwargs).distinct() # Build the filters filters = [] for name, kwargs in page_filters: filters.append({ 'name': name, 'count': doc.pages.filter(**kwargs).distinct().count(), })
![Page 49: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/49.jpg)
Matching names
![Page 50: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/50.jpg)
http://github.com/simonw/datamatcher
![Page 51: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/51.jpg)
On the day
![Page 52: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/52.jpg)
![Page 53: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/53.jpg)
![Page 54: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/54.jpg)
![Page 55: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/55.jpg)
def get_mp_pages(): "Returns list of (mp-name, mp-page-url) tuples" soup = Soup(urllib.urlopen(INDEX_URL)) mp_links = [] for link in soup.findAll('a'): if link.get('title', '').endswith("'s allowances"): mp_links.append( (link['title'].replace("'s allowances", ''), link['href']) ) return mp_links
![Page 56: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/56.jpg)
def get_pdfs(mp_url): "Returns list of (description, years, pdf-url, size) tuples" soup = Soup(urllib.urlopen(mp_url)) pdfs = [] trs = soup.findAll('tr')[1:] # Skip the first, it's the table header for tr in trs: name_td, year_td, pdf_td = tr.findAll('td') name = name_td.string year = year_td.string pdf_url = pdf_td.find('a')['href'] size = pdf_td.find('a').contents[-1].replace('(', '').replace(')', '') pdfs.append( (name, year, pdf_url, size) ) return pdfs
![Page 57: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/57.jpg)
![Page 58: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/58.jpg)
![Page 59: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/59.jpg)
![Page 60: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/60.jpg)
“Drop Everything”
![Page 61: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/61.jpg)
Photoshop + AppleScriptv.s.
Java + IntelliJ
![Page 62: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/62.jpg)
Images on our docroot (S3 upload was taking too long)
![Page 63: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/63.jpg)
Blitz QA
![Page 64: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/64.jpg)
Launch! (on EC2)
![Page 65: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/65.jpg)
![Page 66: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/66.jpg)
Crash #1: more Apache children than MySQL connections
![Page 67: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/67.jpg)
![Page 68: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/68.jpg)
![Page 69: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/69.jpg)
unreviewed_count = Page.objects.filter( votes__isnull = True).distinct().count()
![Page 70: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/70.jpg)
SELECT COUNT(DISTINCT `expenses_page`.`id`)FROM `expenses_page` LEFT OUTER JOIN `expenses_vote` ON ( `expenses_page`.`id` = `expenses_vote`.`page_id` ) WHERE `expenses_vote`.`id` IS NULL
![Page 71: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/71.jpg)
unreviewed_count = cache.get('homepage:unreviewed_count')if unreviewed_count is None: unreviewed_count = Page.objects.filter( votes__isnull = True ).distinct().count() cache.set('homepage: unreviewed_count', unreviewed_count, 60)
![Page 72: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/72.jpg)
• With 70,000 pages and a LOT of votes...
• DB takes up 135% of CPU
• Cache the count in memcached...
• DB drops to %35 of CPU
![Page 73: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/73.jpg)
unreviewed_count = Page.objects.filter( votes__isnull = True ).distinct().count()
reviewed_count = Page.objects.filter( votes__isnull = False ).distinct().count()
![Page 74: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/74.jpg)
unreviewed_count = Page.objects.filter( is_reviewed = False ).count()
![Page 75: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/75.jpg)
Migrating to InnoDB on a separate server
![Page 76: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/76.jpg)
ssh mps-live "mysqldump mp_expenses" |sed 's/ENGINE=MyISAM/ENGINE=InnoDB/g' |
sed 's/CHARSET=latin1/CHARSET=utf8/g' |ssh mysql-big "mysql -u root mp_expenses"
![Page 77: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/77.jpg)
“next” button
![Page 78: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/78.jpg)
def next_global(request): # Next unreviewed page from the whole site all_unreviewed_pages = Page.objects.filter( is_reviewed = False ).order_by('?') if all_unreviewed_pages: return Redirect( all_unreviewed_pages[0].get_absolute_url() ) else: return HttpResponse( 'All pages have been reviewed!' )
![Page 79: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/79.jpg)
import random
def next_global_from_cache(request): page_ids = cache.get('unreviewed_page_ids') if page_ids: return Redirect( '/page/%s/' % random.choice(page_ids) ) else: return next_global(request)
![Page 80: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/80.jpg)
from django.core.management.base import BaseCommandfrom mp_expenses.expenses.models import Pagefrom django.core.cache import cache
class Command(BaseCommand): help = """ populate unreviewed_page_ids in memcached """ requires_model_validation = True can_import_settings = True def handle(self, *args, **options): ids = list(Page.objects.exclude( is_reviewed = True ).values_list('pk', flat=True)[:1000]) cache.set('unreviewed_page_ids', ids)
![Page 81: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/81.jpg)
The numbers
![Page 82: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/82.jpg)
![Page 83: Crowdsourcing with Django](https://reader033.fdocuments.us/reader033/viewer/2022052823/5552c076b4c90581158b4715/html5/thumbnails/83.jpg)
Final thoughts
• High score tables help
• MP photographs really help
• Keeping up the interest is hard
• Next step: start releasing the data