Data collection some techniques -...

73
Genoveva Vargas-Solar CR1, CNRS, LIG-LAFMIA [email protected] http://vargas-solar.com , Montevideo, 21 st November, 2014 INFORMATIQUE Data collection some techniques HADAS GROUP 1 Vijay Upadhyay http://slides.com/myasoobkhalid/web-scraping#/32

Transcript of Data collection some techniques -...

Page 1: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

Genoveva Vargas-Solar CR1, CNRS, LIG-LAFMIA [email protected] http://vargas-solar.com, Montevideo, 21st November, 2014

I N F O R M A T I Q U E

Data collection some techniques

HADAS GROUP

1 Vijay Upadhyay

http://slides.com/myasoobkhalid/web-scraping#/32

Page 2: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

2

Web scraping

http://slides.com/myasoobkhalid/web-­‐scraping#/32    

Page 3: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Web scraper!

n  Any program that retrieves structured data from the web, and then transforms it to conform with a different structure

n  Wait, isn’t that just ETL? (extract, transform, load)

n  Well, sort of, but I don’t want to call it that...

3

Page 4: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Web scraper!

n  “Scraping” applies to web pages, getting data from a CSV or JSON

n  Why not ETL? n  ETL implies that there are rules and expectations n  These two things don’t exist in the world of the Web n  They can change the structure of their dataset without telling you, or even

take the dataset down on a whim.

n  A program that pulls down data is often going to be a bit hacky by necessity, so “scraper” seems like a good term for that

4

Page 5: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Web scraping!

n  Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites

n  Usually, such software programs simulate human exploration of the World Wide Web by either n  Implementing low-level Hypertext Transfer Protocol (HTTP) n  Embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox

Wikipedia

n  Method to extract data from a website that does not have an API or we want to extract a LOT of data which we can not do through an API due to rate limiting

n  Through web scraping we can extract any data which we can see while browsing the web.

5

Page 6: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ What for?!

n  Extract product information

n  Extract job postings and internships

n  Extract offers and discounts from deal-of-the-day websites

n  Crawl forums and social websites

n  Extract data to make a search engine

n  Gathering weather data

6

Page 7: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Web scraping vs. API!

n  Web Scraping is not rate limited

n  Anonymously access the website and gather data

n  Some websites do not have an API

n  Some data is not accessible through an API

7

Page 8: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Web scraping workflow!

n  Get the website - using HTTP library

n  Parse the html document - using any parsing library

n  Store the results - either a db, csv, text file, etc

8

Page 9: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Libraries for parsing!

n  Some of the most widely known libraries used for web scraping are: n  BeautifulSoup n  lxml n  re n  Scrapy ( a complete framework )

9

Page 10: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Parsing libraries!

n  BeautifulSoup n  tree  =  BeautifulSoup(html_doc)  n  tree.title    

n  lxml n  tree  =  lxml.html.fromstring(html_doc)  n  title  =  tree.xpath('/title/text()')    

n  re n  title  =  re.findall('<title>(.*?)</title>',  html_doc)    

10

Page 11: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ BeautifulSoup!

n  A beautiful API n   soup  =  BeautifulSoup(html_doc)  n  last_a_tag  =  soup.find("a",  id="link3")  n  all_b_tags  =  soup.find_all("b")    

n  very easy to use

n  can handle broken markup

n  purely in Python

n  slow :(

11

Page 12: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ lxml!

The lxml XML toolkit provides Pythonic bindings for the C libraries libxml2 and libxslt without sacrificing speed

n  very fast n  not purely in Python n  If you have no "pure Python" requirement use lxml n  lxml works with all python versions from 2.4 to 3.3

12

Page 13: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ re!

n  re is the regex library for Python. It is used only to extract minute amount of text

n  Entire HTML parsing is not possible with regular expressions

n  However it is n  purely baked in Python n  a part of standard library n  very fast - I will show later n  supports every Python version

13

Page 14: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Steps to writing a scraper!n  Find the data source

n  Find the metadata

n  Analysis (verify the primary key): should also include noting which fields should be lookup fields

n  Develop

n  Test: is always done on real data and has three phases: n  dry run (nothing added or updated), n  dry run with lookups (only lookups are added), n  production run: run all three phases on a local instance before deploying to production

n  Fix (repeat ∞ times)

14

Page 15: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Storing scraped data!

n  Do not create tables before you understand how you want to use the data

n  Consider using a non-relational DB

n  See Adrian Holovaty’s talk on how EveryBlock avoided creating new tables for each dataset n  http://bit.ly/Yl6VAZ (relevant part starts at 7:10)

15

Page 16: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

Components of a scraping system!

n  Downloader

n  Cacher n  Caching is essential when scraping a

dataset that involves a large number of HTML pages

n  Test runs can take hours if you’re making requests over the network

n  A good caching system pretty prints the files it downloads so you can more easily inspect them

n  Raw item retriever

n  Existing item detector

n  Item transformer

n  Status reporter: n  Reporting is essential if you’re managing

a group of scrapers. n  Since you KNOW that at least one of

your scrapers will be broken at any time, you might as well know which ones are broken.

n  A good reporting mechanism shows when your scrapers break, as well as when the dataset itself has issues (determined heuristically)

16

Page 17: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Scraping at scale!

n  You want to scrape millions of web pages everyday

n  You want to make a broad scale web scraper

n  You want to use something that is thoroughly tested

n  Is there any solution ?

17

Page 18: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Scrapy (http://scrapy.org)!

n  Application framework for writing web spiders that crawl web sites and extract data from them n  Scrapy only supports Python 2.7 and NOT 3.x n  It's a tested framework n  It's asynchronous n  It's easy to use n  It has everything you need to start scraping

18

Page 19: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

Types of scrapers according to sources!Some tools

19

Page 20: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

Main types of scrapers!

n  CSV

n  RSS/Atom

n  JSON

n  XML

n  HTML crawler

n  Web browser

n  PDF

n  Database dump

n  GIS

n  Mixed

20

Page 21: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ CSV!

n  Import csv  

n  You should usually use csv.DictReader  

n  If the column names are all caps, consider making them lowercase.

n  Watch out for CSV datasets that do not have the same number of elements on each row

21

Page 22: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ CSV! 22

def  get_rows(csv_file):  reader  =  csv.reader(open(csv_file))  #  Get  the  column  names,  lowercased.  column_names  =  tuple(k.lower()  for  k  in  reader.next())  for  row  in  reader:  yield  dict(zip(column_names,  row))  

Page 23: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ XML!

n  import lxml.etree  

n  Get rid of namespaces in the input document. http://bit.ly/LO5x7H  

n  A lot of XML datasets have a fairly flat structure

n  In these cases, convert the elements to dictionaries

23

Page 24: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ XML! 24

<root>  <items>  

 <item>  

   <id>3930277-­‐ac</id>  

   <name>Frodo  Samwise</name>  

   <age>56</age>  

   <occupation>Tolkien  scholar</occupation>          

   <description>Short,  with  hairy  feet</description>  

 </item>  ...  </items>  </root>   import  lxml.etree  

tree  =  lxml.etree.fromstring(SOME_XML_STRING)  for  el  in  tree.findall('items/item'):  children  =  el.getchildren()  #  Keys  are  element  names.  keys  =  (c.tag  for  c  in  children)  #  Values  are  element  text  contents.  values  =  (c.text  for  c  in  children)  yield  dict(zip(keys,  values))  

Page 25: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ HTML!

n  import requests

n  import  lxml.html  

n  Use XPath, but pyquery seems fine too

n  If the HTML is very funky, use html5lib as the parser

n  Sometimes data can be scraped from a chunk of JavaScript embedded in the page

25

Page 26: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ HTML!

n  Firefinder (http://bit.ly/kr0UOY) Extension for Firebug

n  Allows you to test CSS and XPath expressions on any page, and visually inspect the results.

26

Page 27: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ HTTP libraries!

n  Requests n  r  =  requests.get('https://www.google.com').html    

n  urllib and urllib2  n  html  =  urllib2.urlopen('http://python.org/').read()  

n  httplib and httplib2  n  h  =  httplib2.Http(".cache")  n  (resp_headers,  content)  =  h.request("http://example.org/",  "GET")  

27

Page 28: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ PDF!

n  There are no Python libraries that handle all kinds of PDF documents in the wild

n  Use the pdftohtml command to convert the PDF to XML

n  When debugging, use pdftohtml to generate HTML that you can inspect in the browser

n  If the text in the PDF is in tabular format, you can group text cells by proximity

28

Page 29: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ PDF!

The “group by proximity” strategy works like this: n  1. Find a text cell that has a very distinct pattern (probably a date cell)

This is your “anchor” n  2. Find all cells that have the same row position as the anchor

(possibly off by a few pixels) n  3. Figure out which grouped cells belong to which fields based on

column position

29

Page 30: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ RSS/Atom!

n  import feedparser  

n  Sometimes feedparser can’t handle custom fields, and you’ll have to fall back to lxml.etree

n  Unfortunately, plenty of RSS feeds are not compliant XML n  Either do some custom munging or try html5lib  

30

Page 31: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ youtube-dl (http://rg3.github.io/youtube-­‐dl/) !

n  Python script that allows you to download videos and music from various websites like : n  Facebook, n  YouTube n  Vimeo n  Dailymotion n  Metacafen and almost 300 more !

31

Page 32: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Design patterns!

n  If a field contains a finite number of possible values, use a lookup table instead of storing each value

n  Make a scraper superclass that incorporates common scraper logic

n  The scraper superclass will probably have convenience methods for converting dates/times, cleaning HTML, looking for existing items, etc. It should also incorporate the caching and reporting logic

32

Page 33: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

33

Web crawling

Page 34: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Motivation

A key motivation for designing Web crawlers has been to retrieve Web pages and add their representations to a local repository

Page 35: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Web Crawling

n  A Web crawler (also known as a Web spider, Web robot, or—especially in the FOAF community—Web scutter) is a program or automated script that browses the World Wide Web in a

- methodical

- automated manner

n  Other less frequently used names for Web crawlers are ants, automatic indexers, bots, and worms.

Page 36: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Crawlers

n  Computer programs that roam the Web with the goal of automating specific tasks related to the Web

n  The role of Crawlers is to collect Web Content

Page 37: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Basic crawler operation

n  Begin with known “seed” pages

n  Fetch and parse them

n  Extract URLs they point to

n  Place the extracted URLs on a Queue

n  Fetch each URL on the queue and repeat

Page 38: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+

HT'06

Tradi&onal  Web  Crawler   38  

Init

Download resource

Extract URLs

Seed URLs

Frontier

Visited URLs

Web

Repo

Page 39: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Web crawler: basic algorithm    {  

                 Pick  up  the  next  URL  

                 Connect  to  the  server  

                 GET  the  URL  

                 When  the  page  arrives,  get  its  links                      

(optionally  do  other  stuff)  

       REPEAT  

}  

Page 40: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Uses

n  Complete web search engine Search Engine = Crawler + Indexer/Searcher /(Lucene)

+ GUI n  Find stuff n  Gather stuff n  Check stuff

Page 41: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Types of Crawlers

•  Batch : Crawl a snapshot of their crawl space, until reaching a certain size or time limit

•  Incremental : Continuously crawl their crawl space, revisiting URL to ensure freshness

•  Focused: Attempt to crawl pages pertaining to some topic/theme, while minimizing number of off topic pages that are collected

Page 42: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ URL normalization!

n  Crawlers  usually  perform  some  type  of  URL  normaliza&on  in  order  to  avoid  crawling  the  same  resource  more  than  once.    

n  The  term  URL  normaliza-on  refers  to  the  process  of  

                                     modifying  

                                     standardizing      

 a  URL  in  a  consistent  manner    

Page 43: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ The  challenges  of  «  Web  Crawling  »  

Three  characteris&cs  of  the  Web  that  make  crawling  it  very  difficult:  

n  Its  large  volume  

n  Its  fast  rate  of  change  

n  Dynamic  page  genera&on  

Page 44: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Examples of Web crawlers!

•  RBSE

•  World Wide Web Worm

•  Google Crawler

•  WebFountain

•  WebRACE

Page 45: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Web  3.0  Crawling  

Web 3.0 defines advanced technologies and new principles for the next generation search technologies that is summarized in

-Semantic Web

-Website Parse Template concepts

Web 3.0 crawling and indexing technologies will be based on

-Human-machine clever associations

Page 46: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

©  2005  Denise  M.  Gosnell.    All  Rights  Reserved.  

How  Web  API  are  used  ?  

n  Series  or  collec&on  of  web  services  

n  Some&mes  used  interchangeably  with  “web  services”  

n  Examples:  Google  API,  Amazon.Com  APIs  

Page 47: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ How  Do  You  Call  a  Web  API?  

XML web services can be invoked in one of three ways: n  Using REST (HTTP-GET)

n  URL includes parameters n  Example: “ http://search.twitter.com/search.atom?q= “

n  Using HTTP-POST n  You post an XML document n  XML document returned

n  Using SOAP n  More complex, allows structured and type information

Page 48: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ APIs  that  deliver  informa&on  

 Web  Crawling    and  Indexing  Web  API  

App  

Keywords  (Recession,  slump)  Structured  Queries  

(Recession,  22Nov’08,  NY),          

XML    Documents  (Recession,  slump)        

Page 49: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ References  

•  http://en.wikipedia.org/wiki/Web_crawling •  www.cs.cmu.edu/~spandey •  www.cs.odu.edu/~fmccown/research/lazy/crawling-policies-

ht06.ppt •  http://java.sun.com/developer/technicalArticles/ThirdParty/

WebCrawler/ •  www.grub.org •  www.filesland.com/companies/Shettysoft-com/web-crawler.html •  www.ciw.cl/recursos/webCrawling.pdf •  www.openldap.org/conf/odd-wien-2003/peter.pdf

Page 50: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

50

Crowdsourcing

Fansourcing Crowdcasting Open Sourcing

Open Innovation Mass Collaboration Collective Customer Commitment

Wikinomics Collective Intelligence

Page 51: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

Crowdsourcing is the act of taking a job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call

"Crowdsourcing" - The term was coined by Jeff Howe in Wired Magazine in 2006 3

Page 52: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Wisdom of the Crowds!

n  The crowd at a county fair accurately guessed the weight of an ox when their individual guesses were averaged

n  Average n  Closer to the ox's true butchered weight than the estimates of most

crowd members, and also

n  Closer than any of the separate estimates made by cattle experts

Page 53: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Wikinomics 101Wisdom of the Crowds

Ente

rpris

e W

eb 2

.0

U&lity

 

#  of  Contributors  

Expert  $$$$  

Masses  $  

10   100   1000   10,000+  

Equivalent,  or  greater,  u=lity  under  the  Curve  

Page 54: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Economics & Wikinomics En

terp

rise

Web

2.0

Util

ity

# of Contributors

Expert $$$$

Masses $

10 100 1000 10,000+

4,000 experts 80,000 articles 200 years to develop Annual Updates “8.8/10.0 Reliability”

100,000 amateurs 1.6 Million articles 5 years to develop Real-Time Updates “8.0/10.0 Reliability”

Page 55: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ What is crowdsourcing?!

n  Crowdsourcing is an online, distributed problem solving and production model n  Users--also known as the crowd--typically form into online communities based on

the Web site, and the crowd submits solutions to the site or produce its contents n  The crowd can also sort through the solutions, finding the best ones n  These best solutions are then owned by the entity that broadcast the problem in the

first place--the crowdsourcer

n  The winning individuals in the crowd are sometimes rewarded

n  Many individuals in the crowd participate just for intellectual stimulation or because of emotional ties to product or service

55

Page 56: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Benefits of Crowdsourcing to Companies!!n  Problems can be explored at comparatively little cost

n  Payment is by results

n  The organization can tap a wider range of talent than might be present in its own organization

n  Turn customers into designers

n  Turn customers into marketers

Page 57: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

Amazon Mechanical Turk

Page 58: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

Amazon Mechanical Turk

Page 59: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

Crowdsourcing: Rent-A-Coder

Page 60: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

Crowdsourcing: Rent-A-Coder

Page 61: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

Crowdsourcing: Rent-A-Coder

Page 62: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

Crowdsourcing: the benefits!

n Companies Get 5 n  Improved quality and

productivity n  Feedback n  Good Exposure n  Minimum of Cost

n People Get 6 n  Incentive

n  Cash Cash Cash n  Recognition

n  Sense of accomplishment among peers

n  Make Life Better n  Linux n  Obama Campaign

Page 63: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Problems with Crowdsourcing!

n  Quality

n  Intellectual property leakage

n  No time constraint

n  Not much control over development or ultimate product

n  Ill-will with own employees

n  Choosing what to crowdsource & what to keep in-house

Page 64: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Type of problems to outsource!

n  No internal expertise

n  Non-essential and non-critical

n  One that has no time constraint

n  One that benefits from crowd involvement

n  One-time problems

Page 65: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

Some Applications of Crowdsourcing!•  Testing & Refining a Product

�  Netflix �  SellaBand

•  Market Research �  Threadless

¢  Knowledge Management •  Accenture •  Wikipedia

•  Customer Service •  My Starbucks ideas

•  R & D •  InnoCentive •  P&G Connect & Develop

•  Polling and Voting •  InTrade •  Building a new city

Page 66: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Elements for a Wise Crowd!!

•  Diversity of opinion: Each person should have private information even if it's just an eccentric interpretation of the known facts

•  Independence: People's opinions aren't determined by the opinions of those around them

•  Decentralization: People are able to specialize and draw on local knowledge

•  Aggregation: Some mechanism exists for turning private judgments into a collective decision

Page 67: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Reasons to fear Crowd Intelligence!•  Too homogeneous: The need for diversity within a crowd to ensure enough variance in approach,

thought process, and private information.

•  Too centralized: The Columbia Shuttle Disaster, hierarchical NASA management bureaucracy decision making was totally closed to the wisdom of low-level engineers

•  Too divided: The US Intelligence community failed to prevent the September 11 attacks partly because information held by one subdivision was not accessible by another. Crowds work best when they choose for themselves what to work on and what information they need

•  Too imitative: Where choices are visible and made in sequence, an information cascade can form in which only the first few decision makers gain anything by contemplating the choices available

•  Too emotional: Emotional factors, such as a feeling of belonging, can lead to peer pressure and herd mentality

Page 68: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Conclusion:!

n  Crowdsourcing used properly n  Generates New Ideas n  Cuts Development Costs n  Creates a Direct, Emotional, bond with customers

n  Used Improperly n  Can Produce Useless Wasteful Results n  Beware of Mob Rule

“Crowds can be wise, but they can also be stupid. “

Page 69: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+

https://crowd4u.org!

69

Page 70: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Want More Information?!

n  About Crowdsourcing n  Jeff Lowe Blog

n  www.crowdsourcing.com n  The Rise of Crowdsourcing

n  www.wired.com/wired/archive/14.06/crowds.html

n  Paid Crowdsourcing: Current: State and Progress towards Mainstream Business Use n  http://www.marketwire.com/press-release/SmartsheetCom-1045951.html

Page 71: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+Bibliography!

n  Alsever, Jennifer, “What is Crowdsourcing?” www.bnet.com Mar 7th, 2007 Reliability = Good, Article summarized a lot of need to know information about Crowdsourcing as it was just becoming a topic for business.

n  Lowe, Jeff Crowdsourcing Definition http://www.crowdsourcing.com Checked Apt 18th 2009 Reliability = Blog site of Jeff Lowe who coined the term Crowdsourcing. Site contains links and thoughts on articles in the news and feedback from speaking events.

n  Lowe, Jeff “The Rise of Crowdsourcing” www.wired.com 06-Sep Reliability = Great, The original Article where the Term “Crowdsourcing" was born and talks about a few companies that are using it.

n  Frei, Brent “Paid Crowdsourcing: Current State & Progress toward Mainstream Business Use” www.marketwire.com 09/16/2009 Source = Decent Whitepaper on Crowdsourcing includes timelines of adoption as well as companies that are using it and how they are using it.

n  Hempel, Jessi “Crowdsourcing: Milking the Masses for Inspiration” www.businessweek.com 09/25/2006 Reliability = Good, Article talking about how to reign in the Crowdsourced Crowds.

n  Abrahamson, Shaun, “What do Crowds Get from Crowdsourcing” www.mutopo.com 04/12/2009 Reliability = Decent, Article about the motivation of Crowds in Crowdsourcing

n  Netflix “Frequently Asked Questions” www.netflixprize.com 10/01/2006 Reliability = Great, Official Website for Netflix Prize.

n  Copeland, Michael, “Box office boffo for brainiacs: The Netflix Prize” http://brainstormtech.blogs.fortune.cnn.com 09/21/2009 Reliability = Good, A brief news article about the winning Netflix Prize team and some statistics.

Page 72: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy

+ Bibliography (Continued)!

n  Charles, Dan, “Internet Users Join Search For Steve Fossett” www.npr.org 09.12.07 Reliability = Great, Article talking about how the internet search for Steve Fossett started and how it was sent out to the crowds

n  Barbalace, Kenneth, “Internet search for Steve Fossett eight weeks later” blog.environmentalchemistry.com 10/31/2007 Reliability = Decent, Blog Entry about the Internet Search for Steve Fossett and some future applications of the technology used.

n  National Academy of Public Administration, http://opengov.ideascale.com/ Sep 18th, 2009 Reliability = Good, The Website that was opened up for public to submit and vote on policy issues for President Obama

n  Hansell, Saul, "Ideas Online, Yes, but Some Not So Presidential" www.nytimes.com 06/22/2009 Reliability = Great, News Article Talking about Policy Issues Website and Results

n  Various Sources “Just Some Thoughts on the Contest” www.netflixprize.com 07/05/2009 Reliability = Good, Some feedback from the participants on why they thought the Netflix Prize was such a successful contest.

n  Waltner, Charles, “I-Prize Contest Proving a Winning Approach to Discovering Billion-Dollar Business Ideas” newsroom.cisco.com 07/14/2008, Reliability = Great, Information about what the I-prize is and a small amount of information on the winning team

Page 73: Data collection some techniques - vargas-solar.comvargas-solar.com/bigdata-fest/wp-content/uploads/sites/33/2014/11/... · Data collection some techniques HADAS GROUP 1 ... Scrapy