Web Scrapping with Python
-
Upload
miguel-miranda-de-mattos -
Category
Technology
-
view
5.301 -
download
4
description
Transcript of Web Scrapping with Python
Web Scrapping with Python
Miguel Miranda de Mattos:@mmmattos - mmmattos.net
Porto Alegre, Brazil.
2012
Web Scrapping with Python
● Tools:
○ BeautifulSoup
○ Mechanize
BeautifulSoup
An HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
● In Summary:
○ Navigate the "soup" of HTML/XML tags, programatically
○ Access tag´s properties and values
○ Search for tags and their attributes.
BeautifulSoup○ Example:
from BeautifulSoup import BeautifulSoupdoc = "<html><h1>Heading</h1><p>Text"soup = BeautifulSoup(doc)print soup.prettify()
# <html># <h1># Heading# </h1># <p># Text# </p># </html>
○
BeautifulSoup
○ Searching / Looking for things■ 'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild',
'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings', 'findParent', 'findParents', 'findPrevious', 'findPreviousSibling', 'findPreviousSiblings'
■ findAll● findAll(self, name=None, attrs={}, recursive=True,
text=None, limit=None, **kwargs)
● Extracts a list of Tag objects that match the given● criteria. You can specify the name of the Tag and any● attributes you want the Tag to have.
○
BeautifulSoup
● Example:
>>> from BeautifulSoup import BeautifulSoup>>> doc = "<table><tr><td>one</td><td>two</td></tr></table>">>> docSoup = BeautifulSoup(doc) >>> print docSoup.findAll('tr')[<tr><td>one</td><td>two</td></tr>]
>>> print docSoup.findAll('td')[<td>one</td>, <td>two</td>]
BeautifulSoup
● findAll (cont´d.):
>>> for t in docSoup.findAll('td'):>>> print t
<td>one</td><td>two</td>
>>> for t in docSoup.findAll('td'):>>> print t.getText()
onetwo
BeautifulSoup● findAll using attributes to qualify:
>>> soup.findAll('div',attrs = {'class': 'Menus'})[<div>musicMenu</div>,<div>videoMenu</div>]
● For more options:
○ dir (BeautifulSoup)○ help (yourSoup.<command>)
● Use BeautifulSoup rather than regexp patterns:patFinderTitle = re.compile(r'<a[^>]*\stitle="(.*?)"')re.findAll(patFinderTitle, html)
○ bysoup = BeautifulSoup(html)for tag in brand_row_soup.findAll('a'):print tag['title']
Mechanize
● Stateful programmatic web browsing in Python, after Andy Lester’s Perl module.
● mechanize.Browser and mechanize.UserAgentBase, so:○ any URL can be opened, not just http:○ mechanize.UserAgentBase offers easy dynamic configuration of
user-agent features like protocol, cookie, redirection and robots.txt handling, without having to make a new OpenerDirector each time, e.g. by callingbuild_opener().
● Easy HTML form filling.● Convenient link parsing and following.● Browser history (.back() and .reload() methods).● The Referer HTTP header is added properly (optional).● Automatic observance of robots.txt.● Automatic handling of HTTP-Equiv and Refresh.
Mechanize
● Navigation commands:○ open(url)
○ follow_link(link)
○ back()
○ submit()
○ reload()
● Examples
br = mechanize.Browser()br.open("python.org")gothtml = br.response().read()for link in br.links(url_regex="python.org"): print link br.follow_link(link) # takes EITHER Link instance OR keyword args br.back()
Mechanize
● Example:
import reimport mechanize
br = mechanize.Browser()br.open("http://www.example.com/")
# follow second link with element text matching # regular expressionresponse1 = br.follow_link(text_regex=r"cheese\s*shop")
assert br.viewing_html()print br.title()print response1.geturl()print response1.info() # headersprint response1.read() # body
Mechanize
● Example: Combining Mechanize and BeautifulSoup
import reimport mechanizefrom BeautifulSoup import BeutifulSoup
url = "http://www.hp.com"br = mechanize.Browser()
br..open(url) assert br.viewing_html() html = br.response().read() result_soup = BeautifulSoup(html)
found_divs = soup.findAll('div')print "Found " + str(len(found_divs))for d in found_divs:
print d
Mechanize
● Example: Combining Mechanize and BeautifulSoup
import reimport mechanize
url = "http://www.hp.com"br = mechanize.Browser()
br..open(url) assert br.viewing_html() html = br.response().read() result_soup = BeautifulSoup(html)
found_divs = soup.findAll('div')print "Found " + str(len(found_divs))for d in found_divs:
if d.has_key('class'):print d['class']