Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools...
Transcript of Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools...
![Page 1: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/1.jpg)
Dive into Scrapy
@juanriazaJuan Riaza
![Page 2: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/2.jpg)
CHAPTER 1
- THE FANTABULOUS WORLD OF DATA -
![Page 3: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/3.jpg)
Sources of Data
!RSS✉EMAIL
#INTERNET📰DOCUMENTS
![Page 4: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/4.jpg)
🕏APIs
![Page 5: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/5.jpg)
Tradeoffs
Most of the world hasn't embraced API-centric development
Most of the world's interesting data isn't API accessible
![Page 6: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/6.jpg)
APIs Tradeoffs
Throttling
Limited Data
Availability
They know you
![Page 7: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/7.jpg)
The web is thoroughly broken
tl;dr
![Page 8: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/8.jpg)
Web Scraping
“is a computer software technique of extracting information from websites”
![Page 9: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/9.jpg)
- BASIC TOOLSET FOR THE CURIOUS -
Chapter 2
![Page 10: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/10.jpg)
HTTP
Headers, Query String
Status Codes
Methods
Persistence
GET, POST, PUT, HEAD…
2XX, 3XX, 4XX, 418 , 5XX, 999
Accept-language, UA*…
Cookies
![Page 11: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/11.jpg)
Developer Tools
Emulate mobile devices
Network Inspector
Resources
Search XPATH
Elements, Cookies
Filter by XHR
Mobile sites
Extensions Hola, JS Switch…
![Page 12: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/12.jpg)
HTTP Libraries
Urllib2 (stdlib)
requests-oauthlib
python-requests
requestb.in
![Page 13: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/13.jpg)
HTML is not a regular language
![Page 14: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/14.jpg)
HTML Parsers
lxml pythonic binding for the C libraries libxml2 and libxslt
beautifulsoup html.parser, lxml, html5lib
![Page 15: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/15.jpg)
Those who don't understand xpath are cursed to reinvent it, poorly.
![Page 16: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/16.jpg)
# -*- coding: utf-8 -*-import requestsimport lxml.html
req = requests.get('https://fosdem.org/2015/schedule/events/')tree = lxml.html.fromstring(req.text)for tr in tree.xpath('//tr'): content = tr.xpath('./td[1]/a/text()') name = tr.xpath('./td[2]/a/text()')
![Page 17: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/17.jpg)
- TOOLSET FOR THE ADVENTUROUS -
CHAPTER 3
![Page 18: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/18.jpg)
![Page 19: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/19.jpg)
Maybe you'll need multiple HTTP requests.
Scrapy-ify early on
Maybe you'll just want testable code.
![Page 20: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/20.jpg)
“An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.”
![Page 21: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/21.jpg)
Healthy Community
!6.3k ★1.6k forks
500 watchers
" @scrapyproject 1.6k followers
2.7k questions
2k members on mailing list✉
![Page 22: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/22.jpg)
Start a project
$ scrapy startproject <name>fosdem ├── fosdem │ ├── __init__.py │ ├── items.py │ ├── pipelines.py │ ├── settings.py │ └── spiders │ └── __init__.py └── scrapy.cfg
![Page 23: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/23.jpg)
import scrapy
class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/', ]
def parse(self, response): self.log('A response from %s just arrived!' % response.url)
![Page 24: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/24.jpg)
Spiders
Generate the initial Requests
In the callback function, you parse the response and return either Item objects, Request objects, or an iterable of both
start_urls, start_requests()
![Page 25: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/25.jpg)
import scrapy
class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/' ]
def parse(self, response): for h3 in response.xpath(‘//h3/text()’).extract(): yield {‘title’: h3}
for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse)
![Page 26: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/26.jpg)
Interactive Shell
Invaluable tool for developing and debugging your spiders
$ scrapy shell <url>
![Page 27: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/27.jpg)
Interactive Shell
iPython
Invoking the shell from spiders to inspect responses (scrapy.shell.inspect_response)
Available Scrapy Objects spider, request, sel…
Available Shortcuts shelp(), fetch(), view()
![Page 28: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/28.jpg)
Avoid getting banned
Rotate your user agent
Disable cookies
Download delays
Use a pool of rotating IPs
Crawlera
![Page 29: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/29.jpg)
Everything else
Feed Exports
Items, ItemLoaders, Middlewares, Pipelines, Stats
Testing
JSON, CSV, XML, DjangoItem, S3…
Contracts
![Page 30: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/30.jpg)
from django.db import models
class Person(models.Model): name = models.CharField(max_length=255) age = models.IntegerField()
from scrapy.contrib.djangoitem import DjangoItem
class PersonItem(DjangoItem): django_model = Person
DjangoItem
![Page 31: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/31.jpg)
scrapinghub/pycon-speakers!
![Page 32: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/32.jpg)
- DEPLOYMENT -
CHAPTER 4
![Page 33: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/33.jpg)
Scrapyd
Provides a JSON web service to upload new project versions (as eggs) and schedule spiders
$ scrapy deploy
![Page 34: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/34.jpg)
Scrapy Cloud
Scrapy Cloud, our platform as a service offering, allows you to easily build crawlers, deploy them instantly and scale them on demand. Watch your Scrapy spiders as they run and collect data, and review their data through our beautiful frontend.
![Page 35: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/35.jpg)
- ABOUT US -
CHAPTER 5
![Page 36: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/36.jpg)
TONS of Open Source
![Page 37: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/37.jpg)
Mandatory Sales Slide
Professional Services
Scrapy Cloud
Crawlera
Products
![Page 38: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies](https://reader031.fdocuments.us/reader031/viewer/2022020412/5af56f767f8b9a4d4d8f0e46/html5/thumbnails/38.jpg)
We’re hiring!