Web Scraper Shibuya.pm tech talk #8
-
Upload
tatsuhiko-miyagawa -
Category
Technology
-
view
19.737 -
download
0
Transcript of Web Scraper Shibuya.pm tech talk #8
![Page 1: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/1.jpg)
Practical Web Scraping
with Web::Scraper
Tatsuhiko Miyagawa [email protected]
Six Apart, Ltd. / Shibuya Perl MongersShibuya.pm Tech Talks #8
![Page 2: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/2.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Practical Web Scraping
with Web::Scraper
![Page 3: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/3.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.
http://en.wikipedia.org/wiki/Screen_scraping
![Page 4: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/4.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.
http://en.wikipedia.org/wiki/Screen_scraping
![Page 5: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/5.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
"Screen-scrapingis so 1999!"
![Page 6: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/6.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
![Page 7: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/7.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
![Page 8: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/8.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
RSS is a metadatanot a complete
HTML replacement
![Page 9: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/9.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Practical Web Scraping
with Web::Scraper
![Page 10: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/10.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
What's wrong withLWP & Regexp?
![Page 11: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/11.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
![Page 12: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/12.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />
![Page 13: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/13.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />
> perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@<strong id="ctu">(.*?)</strong>@ and print $1'Monday, August 27, 2007 at 12:49:46
![Page 14: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/14.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
It works!
![Page 15: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/15.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
WWW::MySpace 0.70
![Page 16: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/16.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
WWW::Search::Ebay 2.231
![Page 17: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/17.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
WWW::Mixi 0.50
![Page 18: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/18.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
It works …
![Page 19: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/19.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
There are3 problems(at least)
![Page 20: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/20.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
(1)Fragile
Easy to break even with slight HTML changes(like newlines, order of attributes etc.)
![Page 21: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/21.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
(2)Hard to maintain
Regular expression based scrapers are good Only when they're used in write-only scripts
![Page 22: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/22.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
(3)Improper
HTML & encodinghandling
![Page 23: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/23.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
<span class="message">I ♥ Shibuya</span>
> perl –e '$c =~ m@<span class="message">(.*?)</span>@ and print $1'I ♥ Shibuya
![Page 24: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/24.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
<span class="message">I ♥ Shibuya</span>
> perl –MHTML::Entities –e '$c =~ m@<span class="message">(.*?)</span>@ and print decode_entities($1)'I ♥ Shibuya
![Page 25: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/25.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
<span class="message">Perl が大好き! </span>
> perl –MHTML::Entities –MEncode –e '$c =~ m@<span class="message">(.*?)</span>@ and print decode_entities(decode_utf8($1))'Wide character in print at –e line 1.Perl が大好き!
![Page 26: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/26.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
The "right" wayof screen-scraping
![Page 27: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/27.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
(1), (2)MaintainableLess fragile
![Page 28: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/28.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Use XPathand CSS Selectors
![Page 29: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/29.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
XPath
HTML::TreeBuilder::XPathXML::LibXML
![Page 30: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/30.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
XPath
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);print $tree->findnodes('//strong[@id="ctu"]')->shift->as_text;
# Monday, August 27, 2007 at 12:49:46
![Page 31: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/31.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
CSS Selectors
"XPath for HTML coders""XPath for people who hates XML"
![Page 32: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/32.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
CSS Selectors
body { font-size: 12px; }
div.article { padding: 1em }
span#count { color: #fff }
![Page 33: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/33.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
XPath: //strong[@id="ctu"]
CSS Selector: strong#ctu
![Page 34: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/34.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
CSS Selectors
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />
use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath "strong#ctu";print $tree->findnodes($xpath)->shift->as_text;
# Monday, August 27, 2007 at 12:49:46
![Page 35: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/35.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Complete Script#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);
my $ua = LWP::UserAgent->new;my $res = $ua->get("http://www.timeanddate.com/worldclock/");if ($res->is_error) { die "HTTP GET error: ", $res->status_line;}my $content = decode $res->encoding, $res->content;
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath("strong#ctu");my $node = $tree->findnodes($xpath)->shift;print $node->as_text;
![Page 36: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/36.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Robust,Maintainable,
andSane character
handling
![Page 37: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/37.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Exmaple (before)
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />
> perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@<strong id="ctu">(.*?)</strong>@ and print $1'Monday, August 27, 2007 at 12:49:46
![Page 38: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/38.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Example (after)#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);
my $ua = LWP::UserAgent->new;my $res = $ua->get("http://www.timeanddate.com/worldclock/");if ($res->is_error) { die "HTTP GET error: ", $res->status_line;}my $content = decode $res->encoding, $res->content;
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath("strong#ctu");my $node = $tree->findnodes($xpath)->shift;print $node->as_text;
![Page 39: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/39.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
but …long and boring
![Page 40: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/40.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Practical Web Scraping
with Web::Scraper
![Page 41: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/41.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Web scraping toolkitinspired by scrapi.rb
DSL-ish
![Page 42: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/42.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Example (before)#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);
my $ua = LWP::UserAgent->new;my $res = $ua->get("http://www.timeanddate.com/worldclock/");if ($res->is_error) { die "HTTP GET error: ", $res->status_line;}my $content = decode $res->encoding, $res->content;
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath("strong#ctu");my $node = $tree->findnodes($xpath)->shift;print $node->as_text;
![Page 43: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/43.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Example (after)
#!/usr/bin/perl
use strict;
use warnings;
use Web::Scraper;
use URI;
my $s = scraper {
process "strong#ctu", time => 'TEXT';
result 'time';
};
my $uri = URI->new("http://timeanddate.com/worldclock/");
print $s->scrape($uri);
![Page 44: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/44.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Basics
use Web::Scraper;
my $s = scraper {
# DSL goes here
};
my $res = $s->scrape($uri);
![Page 45: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/45.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
process
process $selector,
$key => $what,
…;
![Page 46: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/46.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
$selector:
CSS Selectoror
XPath (start with /)
![Page 47: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/47.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
$key:key for the result
hashappend "[]" for
looping
![Page 48: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/48.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
$what:'@attr''TEXT''RAW'
Web::Scrapersub { … }
Hash reference
![Page 49: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/49.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>
![Page 50: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/50.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
process "ul.sites > li > a",
'urls[]' => '@href';
# { urls => [ … ] }
<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>
![Page 51: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/51.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
process '//ul[@class="sites"]/li/a',
'names[]' => 'TEXT';
# { names => [ 'OpenGuides', … ] }
<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>
![Page 52: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/52.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
process "ul.sites > li",
'sites[]' => scraper {
process 'a',
link => '@href', name => 'TEXT';
};
# { sites => [ { link => …, name => … },
# { link => …, name => … } ] };
<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>
![Page 53: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/53.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
process "ul.sites > li > a",
'sites[]' => sub {
# $_ is HTML::Element
+{ link => $_->attr('href'), name => $_->as_text };
};
# { sites => [ { link => …, name => … },
# { link => …, name => … } ] };
<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>
![Page 54: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/54.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
process "ul.sites > li > a",
'sites[]' => {
link => '@href', name => 'TEXT';
};
# { sites => [ { link => …, name => … },
# { link => …, name => … } ] };
<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>
![Page 55: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/55.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
result
result; # get stash as hashref (default)result @keys; # get stash as hashref containing @keysresult $key; # get value of stash $key;
my $s = scraper { process …; process …; result 'foo', 'bar';};
![Page 56: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/56.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Live Demo
![Page 57: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/57.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Tools
![Page 58: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/58.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
> cpan Web::Scraper
comes with 'scraper' CLI
![Page 59: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/59.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
> scraper http://example.com/
scraper> process "a", "links[]" => '@href';
scraper> d
$VAR1 = {
links => [
'http://example.org/',
'http://example.net/',
],
};
scraper> y
---
links:
- http://example.org/
- http://example.net/
![Page 60: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/60.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
> scraper /path/to/foo.html
> GET http://example.com/ | scraper
![Page 61: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/61.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Recent Updates
![Page 62: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/62.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
0.13'c' and 'c all'
WARN in scraper
![Page 63: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/63.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
0.14automatic absolute URI for link elements
(a@href, img@src)
![Page 64: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/64.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
0.14 (cont.)'RAW' and 'HTML'
![Page 65: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/65.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
0.15$Web::Scraper::UserAgent
$scraper->user_agent
![Page 66: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/66.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
0.19support encoding detection w/ META
tags
![Page 67: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/67.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
TODO
![Page 68: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/68.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Web::ScraperNeeds documentation
![Page 69: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/69.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
More examplesto put in eg/ directory
![Page 70: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/70.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Alternative APIinspired by scRUBYt!
![Page 71: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/71.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
OO Backend APIif you don't like the
DSL
![Page 72: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/72.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
integrate withWWW::Mechanize
and Test::WWW::Declare
![Page 73: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/73.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
XPath Auto-suggestion
off of DOM + element
DOM + XPath => ElementDOM + Element => XPath?
(Template::Extract?)
![Page 74: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/74.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
generic XML support(e.g. RSS/Atom feeds)
![Page 75: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/75.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
extensible text filterdate, geo, hCards (microformats)
<span class="entry-date">October 1st, 2007 17:13:31 +0900</span>
process ".entry-date", date => 'TEXT:rfc822';
![Page 76: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/76.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Summary
![Page 77: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/77.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Web::Scraperinspired by scrapi
![Page 78: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/78.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
easy, fun, maintainable& less fragile
![Page 79: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/79.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
CSS selectorXPath
![Page 80: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/80.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Questions?
![Page 81: Web Scraper Shibuya.pm tech talk #8](https://reader035.fdocuments.us/reader035/viewer/2022070315/5552bfc1b4c90581158b46a5/html5/thumbnails/81.jpg)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8
Thank you
http://search.cpan.org/dist/Web-Scraperhttp://www.slideshare.net/miyagawa/
webscraper