Parse The Web Using Python+Beautiful Soup

Parse the webusing Python + Beautiful Soup

張竟 at ncucccwebb(dot)tw(at)gmail(dot)com

2009年5月26日星期二

mailto:[email protected]

mailto:[email protected]

Agenda

• 工具決定• Python簡介

• Beautiful Soup簡介


Parse the web?but how?


Solutions

• C++

• Java

• Perl

• Python

• Others?


Solutions (Cont.)

• 直接字串處理• Regular expression

• 現有的Parser


So I decide...


Python + Beautiful Soup


Python + Beautiful Soup 看第一頁就知道了


Python 簡介

• high-level programming language

• scripting language

• 傳說中Google最愛用的語言


特色

• 變數不用宣告• 用縮排取代{}

• list tuple dictionary


list

• a=[‘asdf’,123,12.01,‘abcd’]

• a[3] (a[-1])

• 12.01

• a[0:2] (a[:2])

• [‘asdf’,123,12.01]

• b=[‘asdf’,123,[‘qwer’,12.34]]


list (Cont.)

• a=[‘abc’,12]

• len(a)

• #2

• a.append(1)

• #[‘abc’,12,1]

• a.insert(1,‘def’)

• #[‘abc’,‘def’,12,1]


list (Cont.)

• a= [321,456,12,1]

• a.pop()

• #[321,456,12]

• a.index(12)

• #2

• a.sort()

• #1,12,321,456]


tuple

• a=(‘asdf’,123,12.01) or a= ‘asdf’,123,12.01

• a=((‘abc’,1),123.1)

• a,b=1,2


Dictionary

• a={123:‘abc’,‘cde’:456}

• a[123]

• #abc’

• a[‘cde’]

• #456


if else

if a>10:print ‘a>10’

elif a<5:print ‘a<5’

else:print ‘5<a<10

while loop

while a>2 or a<3:pass

for loopa=[‘abc’,123,‘def’]for x in a:

print x

for x in range(3):print x

for x in range(4,34,10):print x

abc123def

012

41424


function

def fib(n):if n==0 or n==1:

return nelse:

return fib(n-1),fib(n-2)


終於可以進入正題了....


What is Beautiful Soup

• 一個python module

• html/xml parser

• 會把html/xml解析成樹狀結構

• 可以對他做搜尋、修改

not Beautiful Soap


Beautiful Soup<html> <head> <title> page title </title> </head> <body> first paragraph one second paragraph two </body></html>

基本操作

from BeautifulSoup import BeautifulSoupsoup=BeautifulSoup(page)

soup.html.head#<head><title>page title</title></head>

soup.head#<head><title>page title</title></head>

soup.body.p#This is paragraphone

check urllib/urllib2 to see how to open a url in python

基本操作(Cont.)

• parent (go to parent node)

soup.title.parent == soup.head

• next (go to next node)

soup.title.next == ‘page title’soup.title.next.next == soup.body

• previous (go to previous node)

soup.title.previous == soup.headsopu.body.p.previous == ‘first paragraph’


基本操作(Cont.)• contents (all content nodes)

soup.html.contents ==[soup.html.head , soup.html.body]

• nextSibling (go to next sibling)

soup.html.body.p.nextSibling== soup.html.body.contents[1]

• previousSibling (previous sibling)

soup.html.body.previousSibling== soup.html.head


基本操作(Cont.)• tag名稱

soup.html.body.name == ‘body’

• 輸出字串soup.html.head.title.string== str(soup.html.head.title)== soup.html.title.head.contents[0]== ‘page title’

• Tag屬性

soup.html.body.p.attrMap== {'align' : 'center', 'id' : 'firstpara'}

soup.html.body.p[‘id’] == 'firstpara'


搜尋

• find(name, attrs, recursive, text)


搜尋


tag名稱


搜尋


tag名稱

tag屬性


搜尋


tag名稱

tag屬性

遞迴搜尋


搜尋


tag名稱

tag屬性

遞迴搜尋

tag內容


find(name, attrs, recursive, text)

• soup.find(‘p’)

#This is paragraphone

find(name, attrs, recursive, text)

soup.find(‘p’) == soup.html.body.p

soup.find(‘p’,id=‘secondpara’)#This is paragraphtwo

soup.find(‘p’,recuresive=False)==None

soup.find(text=‘one’)==soup.b.next

findAll(name, attrs, recursive, text,limit)

soup.findAll(‘p’) == [soup.html.body.p ,soup.p.nextSibling

soup.findAll(‘p’,id=‘secondpara’)#[This is paragraphtwo]

soup.findAll(‘p’,recuresive=False)==[]

soup.findAll(text=‘one’)==soup.b.next

soup.findAll(limit=4)==[soup.html , soup.html.body ,soup.html.body.title , soup.html.body]

Other solutions

• lxml

• html5lib

• HTMLParser

• htmlfill

• Genshi

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/




Reference

• Python Official Websitehttp://www.python.com/ (>///< 兒童不宜)

http://www.python.org/

• Beautiful Soup documentationhttp://www.crummy.com/software/BeautifulSoup/

• personal bloghttp://blog.ez2learn.com/2008/10/05/python-is-the-best-choice-to-grab-web/

• Python html parser performancehttp://blog.ianbicking.org/2008/03/30/python-html-parser-performance/


http://www.python.org




http://www.crummy.com/software/BeautifulSoup/

http://www.crummy.com/software/BeautifulSoup/

http://blog.ez2learn.com/2008/10/05/python-is-the-best-choice-to-grab-web/

http://blog.ez2learn.com/2008/10/05/python-is-the-best-choice-to-grab-web/

Parse The Web Using Python+Beautiful Soup

Technology

Transcript of Parse The Web Using Python+Beautiful Soup