Data collection some techniques -...

Genoveva Vargas-Solar CR1, CNRS, LIG-LAFMIA Genoveva.Vargas@imag.fr http://vargas-solar.com, Montevideo, 21st November, 2014

I N F O R M A T I Q U E

Data collection some techniques

HADAS GROUP

1 Vijay Upadhyay

http://slides.com/myasoobkhalid/web-scraping#/32

Web scraping

http://slides.com/myasoobkhalid/web-‐scraping#/32

+ Web scraper!

n  Any program that retrieves structured data from the web, and then transforms it to conform with a different structure

n  Wait, isn’t that just ETL? (extract, transform, load)

n  Well, sort of, but I don’t want to call it that...

+ Web scraper!

n  “Scraping” applies to web pages, getting data from a CSV or JSON

n  Why not ETL? n  ETL implies that there are rules and expectations n  These two things don’t exist in the world of the Web n  They can change the structure of their dataset without telling you, or even

take the dataset down on a whim.

n  A program that pulls down data is often going to be a bit hacky by necessity, so “scraper” seems like a good term for that

+ Web scraping!

n  Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites

n  Usually, such software programs simulate human exploration of the World Wide Web by either n  Implementing low-level Hypertext Transfer Protocol (HTTP) n  Embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox

Wikipedia

n  Method to extract data from a website that does not have an API or we want to extract a LOT of data which we can not do through an API due to rate limiting

n  Through web scraping we can extract any data which we can see while browsing the web.

+ What for?!

n  Extract product information

n  Extract job postings and internships

n  Extract offers and discounts from deal-of-the-day websites

n  Crawl forums and social websites

n  Extract data to make a search engine

n  Gathering weather data

+ Web scraping vs. API!

n  Web Scraping is not rate limited

n  Anonymously access the website and gather data

n  Some websites do not have an API

n  Some data is not accessible through an API

+ Web scraping workflow!

n  Get the website - using HTTP library

n  Parse the html document - using any parsing library

n  Store the results - either a db, csv, text file, etc

+ Libraries for parsing!

n  Some of the most widely known libraries used for web scraping are: n  BeautifulSoup n  lxml n  re n  Scrapy ( a complete framework )

+ Parsing libraries!

n  BeautifulSoup n  tree = BeautifulSoup(html_doc) n  tree.title

n  lxml n  tree = lxml.html.fromstring(html_doc) n  title = tree.xpath('/title/text()')

n  re n  title = re.findall('<title>(.*?)</title>', html_doc)

+ BeautifulSoup!

n  A beautiful API n  soup = BeautifulSoup(html_doc) n  last_a_tag = soup.find("a", id="link3") n  all_b_tags = soup.find_all("b")

n  very easy to use

n  can handle broken markup

n  purely in Python

n  slow :(

+ lxml!

The lxml XML toolkit provides Pythonic bindings for the C libraries libxml2 and libxslt without sacrificing speed

n  very fast n  not purely in Python n  If you have no "pure Python" requirement use lxml n  lxml works with all python versions from 2.4 to 3.3

n  re is the regex library for Python. It is used only to extract minute amount of text

n  Entire HTML parsing is not possible with regular expressions

n  However it is n  purely baked in Python n  a part of standard library n  very fast - I will show later n  supports every Python version

+ Steps to writing a scraper!n  Find the data source

n  Find the metadata

n  Analysis (verify the primary key): should also include noting which fields should be lookup fields

n  Develop

n  Test: is always done on real data and has three phases: n  dry run (nothing added or updated), n  dry run with lookups (only lookups are added), n  production run: run all three phases on a local instance before deploying to production

n  Fix (repeat ∞ times)

+ Storing scraped data!

n  Do not create tables before you understand how you want to use the data

n  Consider using a non-relational DB

n  See Adrian Holovaty’s talk on how EveryBlock avoided creating new tables for each dataset n  http://bit.ly/Yl6VAZ (relevant part starts at 7:10)

Components of a scraping system!

n  Downloader

n  Cacher n  Caching is essential when scraping a

dataset that involves a large number of HTML pages

n  Test runs can take hours if you’re making requests over the network

n  A good caching system pretty prints the files it downloads so you can more easily inspect them

n  Raw item retriever

n  Existing item detector

n  Item transformer

n  Status reporter: n  Reporting is essential if you’re managing

a group of scrapers. n  Since you KNOW that at least one of

your scrapers will be broken at any time, you might as well know which ones are broken.

n  A good reporting mechanism shows when your scrapers break, as well as when the dataset itself has issues (determined heuristically)

+ Scraping at scale!

n  You want to scrape millions of web pages everyday

n  You want to make a broad scale web scraper

n  You want to use something that is thoroughly tested

n  Is there any solution ?

+ Scrapy (http://scrapy.org)!

n  Application framework for writing web spiders that crawl web sites and extract data from them n  Scrapy only supports Python 2.7 and NOT 3.x n  It's a tested framework n  It's asynchronous n  It's easy to use n  It has everything you need to start scraping

Types of scrapers according to sources!Some tools

Main types of scrapers!

n  CSV

n  RSS/Atom

n  JSON

n  XML

n  HTML crawler

n  Web browser

n  PDF

n  Database dump

n  GIS

n  Mixed

+ CSV!

n  Import csv

n  You should usually use csv.DictReader

n  If the column names are all caps, consider making them lowercase.

n  Watch out for CSV datasets that do not have the same number of elements on each row

+ CSV! 22

def get_rows(csv_file): reader = csv.reader(open(csv_file)) # Get the column names, lowercased. column_names = tuple(k.lower() for k in reader.next()) for row in reader: yield dict(zip(column_names, row))

+ XML!

n  import lxml.etree

n  Get rid of namespaces in the input document. http://bit.ly/LO5x7H

n  A lot of XML datasets have a fairly flat structure

n  In these cases, convert the elements to dictionaries

+ XML! 24

<item>

<name>Frodo Samwise</name>

<occupation>Tolkien scholar</occupation>

<description>Short, with hairy feet</description>

</item> ... </items> </root> import lxml.etree

tree = lxml.etree.fromstring(SOME_XML_STRING) for el in tree.findall('items/item'): children = el.getchildren() # Keys are element names. keys = (c.tag for c in children) # Values are element text contents. values = (c.text for c in children) yield dict(zip(keys, values))

+ HTML!

n  import requests

n  import lxml.html

n  Use XPath, but pyquery seems fine too

n  If the HTML is very funky, use html5lib as the parser

n  Sometimes data can be scraped from a chunk of JavaScript embedded in the page

+ HTML!

n  Firefinder (http://bit.ly/kr0UOY) Extension for Firebug

n  Allows you to test CSS and XPath expressions on any page, and visually inspect the results.

+ HTTP libraries!

n  Requests n  r = requests.get('https://www.google.com').html

n  urllib and urllib2 n  html = urllib2.urlopen('http://python.org/').read()

n  httplib and httplib2 n  h = httplib2.Http(".cache") n  (resp_headers, content) = h.request("http://example.org/", "GET")

+ PDF!

n  There are no Python libraries that handle all kinds of PDF documents in the wild

n  Use the pdftohtml command to convert the PDF to XML

n  When debugging, use pdftohtml to generate HTML that you can inspect in the browser

n  If the text in the PDF is in tabular format, you can group text cells by proximity

+ PDF!

The “group by proximity” strategy works like this: n  1. Find a text cell that has a very distinct pattern (probably a date cell)

This is your “anchor” n  2. Find all cells that have the same row position as the anchor

(possibly off by a few pixels) n  3. Figure out which grouped cells belong to which fields based on

column position

+ RSS/Atom!

n  import feedparser

n  Sometimes feedparser can’t handle custom fields, and you’ll have to fall back to lxml.etree

n  Unfortunately, plenty of RSS feeds are not compliant XML n  Either do some custom munging or try html5lib

+ youtube-dl (http://rg3.github.io/youtube-‐dl/) !

n  Python script that allows you to download videos and music from various websites like : n  Facebook, n  YouTube n  Vimeo n  Dailymotion n  Metacafen and almost 300 more !

+ Design patterns!

n  If a field contains a finite number of possible values, use a lookup table instead of storing each value

n  Make a scraper superclass that incorporates common scraper logic

n  The scraper superclass will probably have convenience methods for converting dates/times, cleaning HTML, looking for existing items, etc. It should also incorporate the caching and reporting logic

Web crawling

+ Motivation

A key motivation for designing Web crawlers has been to retrieve Web pages and add their representations to a local repository

+ Web Crawling

n  A Web crawler (also known as a Web spider, Web robot, or—especially in the FOAF community—Web scutter) is a program or automated script that browses the World Wide Web in a

- methodical

- automated manner

n  Other less frequently used names for Web crawlers are ants, automatic indexers, bots, and worms.

+ Crawlers

n  Computer programs that roam the Web with the goal of automating specific tasks related to the Web

n  The role of Crawlers is to collect Web Content

+ Basic crawler operation

n  Begin with known “seed” pages

n  Fetch and parse them

n  Extract URLs they point to

n  Place the extracted URLs on a Queue

n  Fetch each URL on the queue and repeat

Tradi&onal Web Crawler 38

Download resource

Extract URLs

Seed URLs

Frontier

Visited URLs

+ Web crawler: basic algorithm {

Pick up the next URL

Connect to the server

GET the URL

When the page arrives, get its links

(optionally do other stuff)

REPEAT

+ Uses

n  Complete web search engine Search Engine = Crawler + Indexer/Searcher /(Lucene)

+ GUI n  Find stuff n  Gather stuff n  Check stuff

+ Types of Crawlers

•  Batch : Crawl a snapshot of their crawl space, until reaching a certain size or time limit

•  Incremental : Continuously crawl their crawl space, revisiting URL to ensure freshness

•  Focused: Attempt to crawl pages pertaining to some topic/theme, while minimizing number of off topic pages that are collected

+ URL normalization!

n  Crawlers usually perform some type of URL normaliza&on in order to avoid crawling the same resource more than once.

n  The term URL normaliza-on refers to the process of

modifying

standardizing

a URL in a consistent manner

+ The challenges of « Web Crawling »

Three characteris&cs of the Web that make crawling it very difficult:

n  Its large volume

n  Its fast rate of change

n  Dynamic page genera&on

+ Examples of Web crawlers!

•  RBSE

•  World Wide Web Worm

•  Google Crawler

•  WebFountain

•  WebRACE

+ Web 3.0 Crawling

Web 3.0 defines advanced technologies and new principles for the next generation search technologies that is summarized in

-Semantic Web

-Website Parse Template concepts

Web 3.0 crawling and indexing technologies will be based on

-Human-machine clever associations

How Web API are used ?

n  Series or collec&on of web services

n  Some&mes used interchangeably with “web services”

n  Examples: Google API, Amazon.Com APIs

+ How Do You Call a Web API?

XML web services can be invoked in one of three ways: n  Using REST (HTTP-GET)

n  URL includes parameters n  Example: “ http://search.twitter.com/search.atom?q= “

n  Using HTTP-POST n  You post an XML document n  XML document returned

n  Using SOAP n  More complex, allows structured and type information

+ APIs that deliver informa&on

Web Crawling and Indexing Web API

Keywords (Recession, slump) Structured Queries

(Recession, 22Nov’08, NY),

XML Documents (Recession, slump)

+ References

•  http://en.wikipedia.org/wiki/Web_crawling •  www.cs.cmu.edu/~spandey •  www.cs.odu.edu/~fmccown/research/lazy/crawling-policies-

ht06.ppt •  http://java.sun.com/developer/technicalArticles/ThirdParty/

WebCrawler/ •  www.grub.org •  www.filesland.com/companies/Shettysoft-com/web-crawler.html •  www.ciw.cl/recursos/webCrawling.pdf •  www.openldap.org/conf/odd-wien-2003/peter.pdf

Crowdsourcing

Fansourcing Crowdcasting Open Sourcing

Open Innovation Mass Collaboration Collective Customer Commitment

Wikinomics Collective Intelligence

Crowdsourcing is the act of taking a job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call

"Crowdsourcing" - The term was coined by Jeff Howe in Wired Magazine in 2006 3

+ Wisdom of the Crowds!

n  The crowd at a county fair accurately guessed the weight of an ox when their individual guesses were averaged

n  Average n  Closer to the ox's true butchered weight than the estimates of most

crowd members, and also

n  Closer than any of the separate estimates made by cattle experts

+ Wikinomics 101Wisdom of the Crowds

U&lity

# of Contributors

Expert $$$$

Masses $

10 100 1000 10,000+

Equivalent, or greater, u=lity under the Curve

+ Economics & Wikinomics En

# of Contributors

Expert $$$$

Masses $

10 100 1000 10,000+

4,000 experts 80,000 articles 200 years to develop Annual Updates “8.8/10.0 Reliability”

100,000 amateurs 1.6 Million articles 5 years to develop Real-Time Updates “8.0/10.0 Reliability”

+ What is crowdsourcing?!

n  Crowdsourcing is an online, distributed problem solving and production model n  Users--also known as the crowd--typically form into online communities based on

the Web site, and the crowd submits solutions to the site or produce its contents n  The crowd can also sort through the solutions, finding the best ones n  These best solutions are then owned by the entity that broadcast the problem in the

first place--the crowdsourcer

n  The winning individuals in the crowd are sometimes rewarded

n  Many individuals in the crowd participate just for intellectual stimulation or because of emotional ties to product or service

+ Benefits of Crowdsourcing to Companies!!n  Problems can be explored at comparatively little cost

n  Payment is by results

n  The organization can tap a wider range of talent than might be present in its own organization

n  Turn customers into designers

n  Turn customers into marketers

Amazon Mechanical Turk

Crowdsourcing: Rent-A-Coder

Crowdsourcing: the benefits!

n Companies Get 5 n  Improved quality and

productivity n  Feedback n  Good Exposure n  Minimum of Cost

n People Get 6 n  Incentive

n  Cash Cash Cash n  Recognition

n  Sense of accomplishment among peers

n  Make Life Better n  Linux n  Obama Campaign

+ Problems with Crowdsourcing!

n  Quality

n  Intellectual property leakage

n  No time constraint

n  Not much control over development or ultimate product

n  Ill-will with own employees

n  Choosing what to crowdsource & what to keep in-house

+ Type of problems to outsource!

n  No internal expertise

n  Non-essential and non-critical

n  One that has no time constraint

n  One that benefits from crowd involvement

n  One-time problems

Some Applications of Crowdsourcing!•  Testing & Refining a Product

�  Netflix �  SellaBand

•  Market Research �  Threadless

¢  Knowledge Management •  Accenture •  Wikipedia

•  Customer Service •  My Starbucks ideas

•  R & D •  InnoCentive •  P&G Connect & Develop

•  Polling and Voting •  InTrade •  Building a new city

+ Elements for a Wise Crowd!!

•  Diversity of opinion: Each person should have private information even if it's just an eccentric interpretation of the known facts

•  Independence: People's opinions aren't determined by the opinions of those around them

•  Decentralization: People are able to specialize and draw on local knowledge

•  Aggregation: Some mechanism exists for turning private judgments into a collective decision

+ Reasons to fear Crowd Intelligence!•  Too homogeneous: The need for diversity within a crowd to ensure enough variance in approach,

thought process, and private information.

•  Too centralized: The Columbia Shuttle Disaster, hierarchical NASA management bureaucracy decision making was totally closed to the wisdom of low-level engineers

•  Too divided: The US Intelligence community failed to prevent the September 11 attacks partly because information held by one subdivision was not accessible by another. Crowds work best when they choose for themselves what to work on and what information they need

•  Too imitative: Where choices are visible and made in sequence, an information cascade can form in which only the first few decision makers gain anything by contemplating the choices available

•  Too emotional: Emotional factors, such as a feeling of belonging, can lead to peer pressure and herd mentality

+ Conclusion:!

n  Crowdsourcing used properly n  Generates New Ideas n  Cuts Development Costs n  Creates a Direct, Emotional, bond with customers

n  Used Improperly n  Can Produce Useless Wasteful Results n  Beware of Mob Rule

“Crowds can be wise, but they can also be stupid. “

https://crowd4u.org!

+ Want More Information?!

n  About Crowdsourcing n  Jeff Lowe Blog

n  www.crowdsourcing.com n  The Rise of Crowdsourcing

n  www.wired.com/wired/archive/14.06/crowds.html

n  Paid Crowdsourcing: Current: State and Progress towards Mainstream Business Use n  http://www.marketwire.com/press-release/SmartsheetCom-1045951.html

+Bibliography!

n  Alsever, Jennifer, “What is Crowdsourcing?” www.bnet.com Mar 7th, 2007 Reliability = Good, Article summarized a lot of need to know information about Crowdsourcing as it was just becoming a topic for business.

n  Lowe, Jeff Crowdsourcing Definition http://www.crowdsourcing.com Checked Apt 18th 2009 Reliability = Blog site of Jeff Lowe who coined the term Crowdsourcing. Site contains links and thoughts on articles in the news and feedback from speaking events.

n  Lowe, Jeff “The Rise of Crowdsourcing” www.wired.com 06-Sep Reliability = Great, The original Article where the Term “Crowdsourcing" was born and talks about a few companies that are using it.

n  Frei, Brent “Paid Crowdsourcing: Current State & Progress toward Mainstream Business Use” www.marketwire.com 09/16/2009 Source = Decent Whitepaper on Crowdsourcing includes timelines of adoption as well as companies that are using it and how they are using it.

n  Hempel, Jessi “Crowdsourcing: Milking the Masses for Inspiration” www.businessweek.com 09/25/2006 Reliability = Good, Article talking about how to reign in the Crowdsourced Crowds.

n  Abrahamson, Shaun, “What do Crowds Get from Crowdsourcing” www.mutopo.com 04/12/2009 Reliability = Decent, Article about the motivation of Crowds in Crowdsourcing

n  Netflix “Frequently Asked Questions” www.netflixprize.com 10/01/2006 Reliability = Great, Official Website for Netflix Prize.

n  Copeland, Michael, “Box office boffo for brainiacs: The Netflix Prize” http://brainstormtech.blogs.fortune.cnn.com 09/21/2009 Reliability = Good, A brief news article about the winning Netflix Prize team and some statistics.

+ Bibliography (Continued)!

n  Charles, Dan, “Internet Users Join Search For Steve Fossett” www.npr.org 09.12.07 Reliability = Great, Article talking about how the internet search for Steve Fossett started and how it was sent out to the crowds

n  Barbalace, Kenneth, “Internet search for Steve Fossett eight weeks later” blog.environmentalchemistry.com 10/31/2007 Reliability = Decent, Blog Entry about the Internet Search for Steve Fossett and some future applications of the technology used.

n  National Academy of Public Administration, http://opengov.ideascale.com/ Sep 18th, 2009 Reliability = Good, The Website that was opened up for public to submit and vote on policy issues for President Obama

n  Hansell, Saul, "Ideas Online, Yes, but Some Not So Presidential" www.nytimes.com 06/22/2009 Reliability = Great, News Article Talking about Policy Issues Website and Results

n  Various Sources “Just Some Thoughts on the Contest” www.netflixprize.com 07/05/2009 Reliability = Good, Some feedback from the participants on why they thought the Netflix Prize was such a successful contest.

n  Waltner, Charles, “I-Prize Contest Proving a Winning Approach to Discovering Billion-Dollar Business Ideas” newsroom.cisco.com 07/14/2008, Reliability = Great, Information about what the I-prize is and a small amount of information on the winning team

Data collection some techniques -...

Documents

Transcript of Data collection some techniques -...

MOBILE SOLAR TOWER COVER… · 949-609-9636 | sales@streetlights-solar.com | 23661 Birtcher Dr. Lake Forest CA, 92630 | streetlights-solar.com MOBILE SOLAR TOWER

How to Scrap Any Website's content using ScrapyTutorial of How to scrape (crawling) website's content using Scrapy Python

Python, web scraping and content management: Scrapy and Django

Scrapy and Elasticsearch: Powerful Web Scraping and ...€¦ · Web scraping with Python I Beautifulsoup: Python package for parsing HTML and XML document I lxml: Pythonic binding

SOLIBRO SL2-F CIGS THIN-FILM MODULe Generation 1.5 …solibro-solar.com/.../content_uploads/...G1.5_2015-09_Rev05_EN.pdf · Ideal for visually sophisticated PV solutions ... Snow

Web Data Extractors 2018whitepapers.virtualprivatelibrary.net/Web Data Extractors.pdf · Web Data Extractors 2018 – A White Paper Link Compilation ... Scrapy – Open Source Web

Scrapy - An open source web scraping framework for Pythonfiles.meetup.com/6816242/(Pycon Taipei) Scrapy-20130328.pdf2013/03/28 · • Scrapy is a fast high-level screen scraping

TECHNICAL INFORMATION / INSTALLATION INSTRUCTIONS …secusol-install.wagner-solar.com/SECUSOL-installation-manual.pdf · 4 EN-USA_SECUSOL_TI-MA-111013-1121R800 Table 2 Storage dimensions

Scrapy Documentationmedia.readthedocs.org/pdf/scrapy/0.12/scrapy.pdf · •Ask a question in the #scrapy IRC channel. •Report bugs with Scrapy in ourticket tracker. 3. Scrapy Documentation,

Web scrapingcarlosgmartin.com/scrapingslides.pdf · 2020-04-04 · Web scraping with Scrapy. Web crawler a program that systematically browses the web. ... Scrapy web scraping framework

How to Scrap Any Website's content using Scrapy

Uni-Solar Corporate Overview Brochureuni-solar.com/wp-content/uploads/pdf/Corporate_Overview_Brochure… · UNI-SOLAR laminates are well-suited to the commercial ... Corporate Overview

Scrapy Cluster Documentation - Read the Docs · 2019-04-02 · Scrapy Cluster Documentation, Release 1.0 •Allows you to arbitrarily add/remove/scale your scrapers from the pool

PORTA… · 2020. 3. 3. · porta 949-609-9636 | sales@streetlights-solar.com | 23661 Birtcher Dr. Lake Forest CA, 92630 | streetlights-solar.com PATHWAY STREET PARKING PARK SECURITY

Lecture 2 - Collecting, Analyzing, and Visualizing Data with Python … 2 - Collecting... · • Using web scraping frameworks like Scrapy • Writing your own code. Using Application

Fronius IG Plus USA - Affordable Solarstore.affordable-solar.com/site/doc/Doc_Operating_Manual_Fronius... · IMPORTANT SAFETY INSTRUCTIONS SAVE THESE INSTRUCTIONS General This manual

Scrapy Documentation · •Ask a question in the #scrapy IRC channel. •Report bugs with Scrapy in ourissue tracker. 3. Scrapy Documentation, Release 0.15.0 4 Chapter 1. ... •A

Scrapy and Elasticsearch: Powerful Web Scraping and ... · Scrapy and Elasticsearch: Powerful Web Scraping and Searching with Python Michael Rüegg Swiss Python Summit 2016, Rapperswil

SP3000 ~ 5000 Handy SP Handy Series - OPTI-Solar Handy.pdfwww .opti -solar.com info@ opti -solar.com Use recycled paper for a green future. ©2017 OPTI International Corporation. Names

Scrapy Cluster Documentation - Read the Docs · Lets assume our project is now in ~/scrapy-cluster 3.Stand up the Scrapy Cluster Vagrant machine. By default this will start an Ubuntu