Data collection some techniques -...

Post on 11-May-2018

212 views 0 download

Transcript of Data collection some techniques -...

Genoveva Vargas-Solar CR1, CNRS, LIG-LAFMIA Genoveva.Vargas@imag.fr http://vargas-solar.com, Montevideo, 21st November, 2014

I N F O R M A T I Q U E

Data collection some techniques

HADAS GROUP

1 Vijay Upadhyay

http://slides.com/myasoobkhalid/web-scraping#/32

2

Web scraping

http://slides.com/myasoobkhalid/web-­‐scraping#/32    

+ Web scraper!

n  Any program that retrieves structured data from the web, and then transforms it to conform with a different structure

n  Wait, isn’t that just ETL? (extract, transform, load)

n  Well, sort of, but I don’t want to call it that...

3

+ Web scraper!

n  “Scraping” applies to web pages, getting data from a CSV or JSON

n  Why not ETL? n  ETL implies that there are rules and expectations n  These two things don’t exist in the world of the Web n  They can change the structure of their dataset without telling you, or even

take the dataset down on a whim.

n  A program that pulls down data is often going to be a bit hacky by necessity, so “scraper” seems like a good term for that

4

+ Web scraping!

n  Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites

n  Usually, such software programs simulate human exploration of the World Wide Web by either n  Implementing low-level Hypertext Transfer Protocol (HTTP) n  Embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox

Wikipedia

n  Method to extract data from a website that does not have an API or we want to extract a LOT of data which we can not do through an API due to rate limiting

n  Through web scraping we can extract any data which we can see while browsing the web.

5

+ What for?!

n  Extract product information

n  Extract job postings and internships

n  Extract offers and discounts from deal-of-the-day websites

n  Crawl forums and social websites

n  Extract data to make a search engine

n  Gathering weather data

6

+ Web scraping vs. API!

n  Web Scraping is not rate limited

n  Anonymously access the website and gather data

n  Some websites do not have an API

n  Some data is not accessible through an API

7

+ Web scraping workflow!

n  Get the website - using HTTP library

n  Parse the html document - using any parsing library

n  Store the results - either a db, csv, text file, etc

8

+ Libraries for parsing!

n  Some of the most widely known libraries used for web scraping are: n  BeautifulSoup n  lxml n  re n  Scrapy ( a complete framework )

9

+ Parsing libraries!

n  BeautifulSoup n  tree  =  BeautifulSoup(html_doc)  n  tree.title    

n  lxml n  tree  =  lxml.html.fromstring(html_doc)  n  title  =  tree.xpath('/title/text()')    

n  re n  title  =  re.findall('<title>(.*?)</title>',  html_doc)    

10

+ BeautifulSoup!

n  A beautiful API n   soup  =  BeautifulSoup(html_doc)  n  last_a_tag  =  soup.find("a",  id="link3")  n  all_b_tags  =  soup.find_all("b")    

n  very easy to use

n  can handle broken markup

n  purely in Python

n  slow :(

11

+ lxml!

The lxml XML toolkit provides Pythonic bindings for the C libraries libxml2 and libxslt without sacrificing speed

n  very fast n  not purely in Python n  If you have no "pure Python" requirement use lxml n  lxml works with all python versions from 2.4 to 3.3

12

+ re!

n  re is the regex library for Python. It is used only to extract minute amount of text

n  Entire HTML parsing is not possible with regular expressions

n  However it is n  purely baked in Python n  a part of standard library n  very fast - I will show later n  supports every Python version

13

+ Steps to writing a scraper!n  Find the data source

n  Find the metadata

n  Analysis (verify the primary key): should also include noting which fields should be lookup fields

n  Develop

n  Test: is always done on real data and has three phases: n  dry run (nothing added or updated), n  dry run with lookups (only lookups are added), n  production run: run all three phases on a local instance before deploying to production

n  Fix (repeat ∞ times)

14

+ Storing scraped data!

n  Do not create tables before you understand how you want to use the data

n  Consider using a non-relational DB

n  See Adrian Holovaty’s talk on how EveryBlock avoided creating new tables for each dataset n  http://bit.ly/Yl6VAZ (relevant part starts at 7:10)

15

Components of a scraping system!

n  Downloader

n  Cacher n  Caching is essential when scraping a

dataset that involves a large number of HTML pages

n  Test runs can take hours if you’re making requests over the network

n  A good caching system pretty prints the files it downloads so you can more easily inspect them

n  Raw item retriever

n  Existing item detector

n  Item transformer

n  Status reporter: n  Reporting is essential if you’re managing

a group of scrapers. n  Since you KNOW that at least one of

your scrapers will be broken at any time, you might as well know which ones are broken.

n  A good reporting mechanism shows when your scrapers break, as well as when the dataset itself has issues (determined heuristically)

16

+ Scraping at scale!

n  You want to scrape millions of web pages everyday

n  You want to make a broad scale web scraper

n  You want to use something that is thoroughly tested

n  Is there any solution ?

17

+ Scrapy (http://scrapy.org)!

n  Application framework for writing web spiders that crawl web sites and extract data from them n  Scrapy only supports Python 2.7 and NOT 3.x n  It's a tested framework n  It's asynchronous n  It's easy to use n  It has everything you need to start scraping

18

Types of scrapers according to sources!Some tools

19

Main types of scrapers!

n  CSV

n  RSS/Atom

n  JSON

n  XML

n  HTML crawler

n  Web browser

n  PDF

n  Database dump

n  GIS

n  Mixed

20

+ CSV!

n  Import csv  

n  You should usually use csv.DictReader  

n  If the column names are all caps, consider making them lowercase.

n  Watch out for CSV datasets that do not have the same number of elements on each row

21

+ CSV! 22

def  get_rows(csv_file):  reader  =  csv.reader(open(csv_file))  #  Get  the  column  names,  lowercased.  column_names  =  tuple(k.lower()  for  k  in  reader.next())  for  row  in  reader:  yield  dict(zip(column_names,  row))  

+ XML!

n  import lxml.etree  

n  Get rid of namespaces in the input document. http://bit.ly/LO5x7H  

n  A lot of XML datasets have a fairly flat structure

n  In these cases, convert the elements to dictionaries

23

+ XML! 24

<root>  <items>  

 <item>  

   <id>3930277-­‐ac</id>  

   <name>Frodo  Samwise</name>  

   <age>56</age>  

   <occupation>Tolkien  scholar</occupation>          

   <description>Short,  with  hairy  feet</description>  

 </item>  ...  </items>  </root>   import  lxml.etree  

tree  =  lxml.etree.fromstring(SOME_XML_STRING)  for  el  in  tree.findall('items/item'):  children  =  el.getchildren()  #  Keys  are  element  names.  keys  =  (c.tag  for  c  in  children)  #  Values  are  element  text  contents.  values  =  (c.text  for  c  in  children)  yield  dict(zip(keys,  values))  

+ HTML!

n  import requests

n  import  lxml.html  

n  Use XPath, but pyquery seems fine too

n  If the HTML is very funky, use html5lib as the parser

n  Sometimes data can be scraped from a chunk of JavaScript embedded in the page

25

+ HTML!

n  Firefinder (http://bit.ly/kr0UOY) Extension for Firebug

n  Allows you to test CSS and XPath expressions on any page, and visually inspect the results.

26

+ HTTP libraries!

n  Requests n  r  =  requests.get('https://www.google.com').html    

n  urllib and urllib2  n  html  =  urllib2.urlopen('http://python.org/').read()  

n  httplib and httplib2  n  h  =  httplib2.Http(".cache")  n  (resp_headers,  content)  =  h.request("http://example.org/",  "GET")  

27

+ PDF!

n  There are no Python libraries that handle all kinds of PDF documents in the wild

n  Use the pdftohtml command to convert the PDF to XML

n  When debugging, use pdftohtml to generate HTML that you can inspect in the browser

n  If the text in the PDF is in tabular format, you can group text cells by proximity

28

+ PDF!

The “group by proximity” strategy works like this: n  1. Find a text cell that has a very distinct pattern (probably a date cell)

This is your “anchor” n  2. Find all cells that have the same row position as the anchor

(possibly off by a few pixels) n  3. Figure out which grouped cells belong to which fields based on

column position

29

+ RSS/Atom!

n  import feedparser  

n  Sometimes feedparser can’t handle custom fields, and you’ll have to fall back to lxml.etree

n  Unfortunately, plenty of RSS feeds are not compliant XML n  Either do some custom munging or try html5lib  

30

+ youtube-dl (http://rg3.github.io/youtube-­‐dl/) !

n  Python script that allows you to download videos and music from various websites like : n  Facebook, n  YouTube n  Vimeo n  Dailymotion n  Metacafen and almost 300 more !

31

+ Design patterns!

n  If a field contains a finite number of possible values, use a lookup table instead of storing each value

n  Make a scraper superclass that incorporates common scraper logic

n  The scraper superclass will probably have convenience methods for converting dates/times, cleaning HTML, looking for existing items, etc. It should also incorporate the caching and reporting logic

32

33

Web crawling

+ Motivation

A key motivation for designing Web crawlers has been to retrieve Web pages and add their representations to a local repository

+ Web Crawling

n  A Web crawler (also known as a Web spider, Web robot, or—especially in the FOAF community—Web scutter) is a program or automated script that browses the World Wide Web in a

- methodical

- automated manner

n  Other less frequently used names for Web crawlers are ants, automatic indexers, bots, and worms.

+ Crawlers

n  Computer programs that roam the Web with the goal of automating specific tasks related to the Web

n  The role of Crawlers is to collect Web Content

+ Basic crawler operation

n  Begin with known “seed” pages

n  Fetch and parse them

n  Extract URLs they point to

n  Place the extracted URLs on a Queue

n  Fetch each URL on the queue and repeat

+

HT'06

Tradi&onal  Web  Crawler   38  

Init

Download resource

Extract URLs

Seed URLs

Frontier

Visited URLs

Web

Repo

+ Web crawler: basic algorithm    {  

                 Pick  up  the  next  URL  

                 Connect  to  the  server  

                 GET  the  URL  

                 When  the  page  arrives,  get  its  links                      

(optionally  do  other  stuff)  

       REPEAT  

}  

+ Uses

n  Complete web search engine Search Engine = Crawler + Indexer/Searcher /(Lucene)

+ GUI n  Find stuff n  Gather stuff n  Check stuff

+ Types of Crawlers

•  Batch : Crawl a snapshot of their crawl space, until reaching a certain size or time limit

•  Incremental : Continuously crawl their crawl space, revisiting URL to ensure freshness

•  Focused: Attempt to crawl pages pertaining to some topic/theme, while minimizing number of off topic pages that are collected

+ URL normalization!

n  Crawlers  usually  perform  some  type  of  URL  normaliza&on  in  order  to  avoid  crawling  the  same  resource  more  than  once.    

n  The  term  URL  normaliza-on  refers  to  the  process  of  

                                     modifying  

                                     standardizing      

 a  URL  in  a  consistent  manner    

+ The  challenges  of  «  Web  Crawling  »  

Three  characteris&cs  of  the  Web  that  make  crawling  it  very  difficult:  

n  Its  large  volume  

n  Its  fast  rate  of  change  

n  Dynamic  page  genera&on  

+ Examples of Web crawlers!

•  RBSE

•  World Wide Web Worm

•  Google Crawler

•  WebFountain

•  WebRACE

+ Web  3.0  Crawling  

Web 3.0 defines advanced technologies and new principles for the next generation search technologies that is summarized in

-Semantic Web

-Website Parse Template concepts

Web 3.0 crawling and indexing technologies will be based on

-Human-machine clever associations

©  2005  Denise  M.  Gosnell.    All  Rights  Reserved.  

How  Web  API  are  used  ?  

n  Series  or  collec&on  of  web  services  

n  Some&mes  used  interchangeably  with  “web  services”  

n  Examples:  Google  API,  Amazon.Com  APIs  

+ How  Do  You  Call  a  Web  API?  

XML web services can be invoked in one of three ways: n  Using REST (HTTP-GET)

n  URL includes parameters n  Example: “ http://search.twitter.com/search.atom?q= “

n  Using HTTP-POST n  You post an XML document n  XML document returned

n  Using SOAP n  More complex, allows structured and type information

+ APIs  that  deliver  informa&on  

 Web  Crawling    and  Indexing  Web  API  

App  

Keywords  (Recession,  slump)  Structured  Queries  

(Recession,  22Nov’08,  NY),          

XML    Documents  (Recession,  slump)        

+ References  

•  http://en.wikipedia.org/wiki/Web_crawling •  www.cs.cmu.edu/~spandey •  www.cs.odu.edu/~fmccown/research/lazy/crawling-policies-

ht06.ppt •  http://java.sun.com/developer/technicalArticles/ThirdParty/

WebCrawler/ •  www.grub.org •  www.filesland.com/companies/Shettysoft-com/web-crawler.html •  www.ciw.cl/recursos/webCrawling.pdf •  www.openldap.org/conf/odd-wien-2003/peter.pdf

50

Crowdsourcing

Fansourcing Crowdcasting Open Sourcing

Open Innovation Mass Collaboration Collective Customer Commitment

Wikinomics Collective Intelligence

Crowdsourcing is the act of taking a job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call

"Crowdsourcing" - The term was coined by Jeff Howe in Wired Magazine in 2006 3

+ Wisdom of the Crowds!

n  The crowd at a county fair accurately guessed the weight of an ox when their individual guesses were averaged

n  Average n  Closer to the ox's true butchered weight than the estimates of most

crowd members, and also

n  Closer than any of the separate estimates made by cattle experts

+ Wikinomics 101Wisdom of the Crowds

Ente

rpris

e W

eb 2

.0

U&lity

 

#  of  Contributors  

Expert  $$$$  

Masses  $  

10   100   1000   10,000+  

Equivalent,  or  greater,  u=lity  under  the  Curve  

+ Economics & Wikinomics En

terp

rise

Web

2.0

Util

ity

# of Contributors

Expert $$$$

Masses $

10 100 1000 10,000+

4,000 experts 80,000 articles 200 years to develop Annual Updates “8.8/10.0 Reliability”

100,000 amateurs 1.6 Million articles 5 years to develop Real-Time Updates “8.0/10.0 Reliability”

+ What is crowdsourcing?!

n  Crowdsourcing is an online, distributed problem solving and production model n  Users--also known as the crowd--typically form into online communities based on

the Web site, and the crowd submits solutions to the site or produce its contents n  The crowd can also sort through the solutions, finding the best ones n  These best solutions are then owned by the entity that broadcast the problem in the

first place--the crowdsourcer

n  The winning individuals in the crowd are sometimes rewarded

n  Many individuals in the crowd participate just for intellectual stimulation or because of emotional ties to product or service

55

+ Benefits of Crowdsourcing to Companies!!n  Problems can be explored at comparatively little cost

n  Payment is by results

n  The organization can tap a wider range of talent than might be present in its own organization

n  Turn customers into designers

n  Turn customers into marketers

Amazon Mechanical Turk

Amazon Mechanical Turk

Crowdsourcing: Rent-A-Coder

Crowdsourcing: Rent-A-Coder

Crowdsourcing: Rent-A-Coder

Crowdsourcing: the benefits!

n Companies Get 5 n  Improved quality and

productivity n  Feedback n  Good Exposure n  Minimum of Cost

n People Get 6 n  Incentive

n  Cash Cash Cash n  Recognition

n  Sense of accomplishment among peers

n  Make Life Better n  Linux n  Obama Campaign

+ Problems with Crowdsourcing!

n  Quality

n  Intellectual property leakage

n  No time constraint

n  Not much control over development or ultimate product

n  Ill-will with own employees

n  Choosing what to crowdsource & what to keep in-house

+ Type of problems to outsource!

n  No internal expertise

n  Non-essential and non-critical

n  One that has no time constraint

n  One that benefits from crowd involvement

n  One-time problems

Some Applications of Crowdsourcing!•  Testing & Refining a Product

�  Netflix �  SellaBand

•  Market Research �  Threadless

¢  Knowledge Management •  Accenture •  Wikipedia

•  Customer Service •  My Starbucks ideas

•  R & D •  InnoCentive •  P&G Connect & Develop

•  Polling and Voting •  InTrade •  Building a new city

+ Elements for a Wise Crowd!!

•  Diversity of opinion: Each person should have private information even if it's just an eccentric interpretation of the known facts

•  Independence: People's opinions aren't determined by the opinions of those around them

•  Decentralization: People are able to specialize and draw on local knowledge

•  Aggregation: Some mechanism exists for turning private judgments into a collective decision

+ Reasons to fear Crowd Intelligence!•  Too homogeneous: The need for diversity within a crowd to ensure enough variance in approach,

thought process, and private information.

•  Too centralized: The Columbia Shuttle Disaster, hierarchical NASA management bureaucracy decision making was totally closed to the wisdom of low-level engineers

•  Too divided: The US Intelligence community failed to prevent the September 11 attacks partly because information held by one subdivision was not accessible by another. Crowds work best when they choose for themselves what to work on and what information they need

•  Too imitative: Where choices are visible and made in sequence, an information cascade can form in which only the first few decision makers gain anything by contemplating the choices available

•  Too emotional: Emotional factors, such as a feeling of belonging, can lead to peer pressure and herd mentality

+ Conclusion:!

n  Crowdsourcing used properly n  Generates New Ideas n  Cuts Development Costs n  Creates a Direct, Emotional, bond with customers

n  Used Improperly n  Can Produce Useless Wasteful Results n  Beware of Mob Rule

“Crowds can be wise, but they can also be stupid. “

+

https://crowd4u.org!

69

+ Want More Information?!

n  About Crowdsourcing n  Jeff Lowe Blog

n  www.crowdsourcing.com n  The Rise of Crowdsourcing

n  www.wired.com/wired/archive/14.06/crowds.html

n  Paid Crowdsourcing: Current: State and Progress towards Mainstream Business Use n  http://www.marketwire.com/press-release/SmartsheetCom-1045951.html

+Bibliography!

n  Alsever, Jennifer, “What is Crowdsourcing?” www.bnet.com Mar 7th, 2007 Reliability = Good, Article summarized a lot of need to know information about Crowdsourcing as it was just becoming a topic for business.

n  Lowe, Jeff Crowdsourcing Definition http://www.crowdsourcing.com Checked Apt 18th 2009 Reliability = Blog site of Jeff Lowe who coined the term Crowdsourcing. Site contains links and thoughts on articles in the news and feedback from speaking events.

n  Lowe, Jeff “The Rise of Crowdsourcing” www.wired.com 06-Sep Reliability = Great, The original Article where the Term “Crowdsourcing" was born and talks about a few companies that are using it.

n  Frei, Brent “Paid Crowdsourcing: Current State & Progress toward Mainstream Business Use” www.marketwire.com 09/16/2009 Source = Decent Whitepaper on Crowdsourcing includes timelines of adoption as well as companies that are using it and how they are using it.

n  Hempel, Jessi “Crowdsourcing: Milking the Masses for Inspiration” www.businessweek.com 09/25/2006 Reliability = Good, Article talking about how to reign in the Crowdsourced Crowds.

n  Abrahamson, Shaun, “What do Crowds Get from Crowdsourcing” www.mutopo.com 04/12/2009 Reliability = Decent, Article about the motivation of Crowds in Crowdsourcing

n  Netflix “Frequently Asked Questions” www.netflixprize.com 10/01/2006 Reliability = Great, Official Website for Netflix Prize.

n  Copeland, Michael, “Box office boffo for brainiacs: The Netflix Prize” http://brainstormtech.blogs.fortune.cnn.com 09/21/2009 Reliability = Good, A brief news article about the winning Netflix Prize team and some statistics.

+ Bibliography (Continued)!

n  Charles, Dan, “Internet Users Join Search For Steve Fossett” www.npr.org 09.12.07 Reliability = Great, Article talking about how the internet search for Steve Fossett started and how it was sent out to the crowds

n  Barbalace, Kenneth, “Internet search for Steve Fossett eight weeks later” blog.environmentalchemistry.com 10/31/2007 Reliability = Decent, Blog Entry about the Internet Search for Steve Fossett and some future applications of the technology used.

n  National Academy of Public Administration, http://opengov.ideascale.com/ Sep 18th, 2009 Reliability = Good, The Website that was opened up for public to submit and vote on policy issues for President Obama

n  Hansell, Saul, "Ideas Online, Yes, but Some Not So Presidential" www.nytimes.com 06/22/2009 Reliability = Great, News Article Talking about Policy Issues Website and Results

n  Various Sources “Just Some Thoughts on the Contest” www.netflixprize.com 07/05/2009 Reliability = Good, Some feedback from the participants on why they thought the Netflix Prize was such a successful contest.

n  Waltner, Charles, “I-Prize Contest Proving a Winning Approach to Discovering Billion-Dollar Business Ideas” newsroom.cisco.com 07/14/2008, Reliability = Great, Information about what the I-prize is and a small amount of information on the winning team