SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd...
-
date post
15-Jan-2016 -
Category
Documents
-
view
214 -
download
0
Transcript of SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd...
![Page 1: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/1.jpg)
SLASHPack: CollectorPerformance Improvement andEvaluation
Rudd Stevens
CS 690
Spring 2006
![Page 2: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/2.jpg)
SLASHPack Collector - 5/4/2006 2
Outline
1. Introduction, system overview and design.
2. Performance modifications, re-factoring and re-structuring.
3. Performance testing results and evaluation.
![Page 3: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/3.jpg)
SLASHPack Collector - 5/4/2006 3
Outline
1. Introduction, system overview and design.
2. Performance modifications, re-factoring and re-structuring.
3. Performance testing results and evaluation.
![Page 4: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/4.jpg)
SLASHPack Collector - 5/4/2006 4
Introduction
SLASHPack Toolkit(Semi-LArge Scale Hypertext Package)
Sponsored by Prof. Chris Brooks, engineered for initial clients Nancy Montanez and Ryan King.
Collector component Framework for collecting documents.
Evaluate and improve performance.
![Page 5: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/5.jpg)
SLASHPack Collector - 5/4/2006 5
Contact and Information Sources Contact Information:
Rudd Stevens
rstevens (at) cs.usfca.edu
Project Website: http://www.cs.usfca.edu/~rstevens/slashpack/collector/
Project Sponsor: Professor Christopher Brooks
Department of Computer Science
University of San Francisco
cbrooks (at) cs.usfca.edu
![Page 6: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/6.jpg)
SLASHPack Collector - 5/4/2006 6
Stages Addition of protocol module for Weblog data set.
Performance testing using the Weblog and HTTP modules. Identify problem areas.
Modify Collector to improve scalability and performance.
Repeat performance testing and evaluate performance improvements.
![Page 7: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/7.jpg)
SLASHPack Collector - 5/4/2006 7
Implementation Language: Python
Platform: Any Python supported OS. Python 2.4 or later (Developed and tested under Linux.)
Progress: Fully built, newly re-factored for performance and usability.
![Page 8: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/8.jpg)
SLASHPack Collector - 5/4/2006 8
High level design
SLASHPack designed as a framework.
Modular components, that contain sub-modules.
Collector pluggable for protocol modules, parsers, filters, output writers, etc.
![Page 9: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/9.jpg)
SLASHPack Collector - 5/4/2006 9
High level design (cont.)
![Page 10: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/10.jpg)
SLASHPack Collector - 5/4/2006 10
Outline
1. Introduction, system overview and design.
2. Performance modifications, re-factoring and re-structuring.
3. Performance testing results and evaluation.
![Page 11: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/11.jpg)
SLASHPack Collector - 5/4/2006 11
Performance Testing Large scale text collection.
Weblog data set.Long web crawls.
Performance testing monitoring Python Profiling.Integrated Statistics.
Functionality TestingPython logging.Functionality test runs.
![Page 12: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/12.jpg)
SLASHPack Collector - 5/4/2006 12
Collector Runtime StatisticsUrlFrontier
Url Frontier size,
current number of links: 3465
Urls requested from frontier: 659
Url Frontier,
current number of server queues: 78
Urls delivered from frontier: 639
Collector
Documents per second:
3.70328865405
Total runtime:
2 Minutes 31.4869761467 Seconds
UrlSentry
Urls filtered using robots: 38
Urls filtered for depth: 9
Urls Processed: 5881
Urls filtered using filters: 165
UrlBookkeeper
Duplicate Urls: 1557
Urls recorded: 4104
![Page 13: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/13.jpg)
SLASHPack Collector - 5/4/2006 13
Collector Runtime StatisticsDocFingerprinter
Documents Written: 386
Average Document Size (bytes):
20570
HTTP Status Responses:
200: 394 204: 10
301: 8 302: 25
404: 91 403: 7
401: 1 400: 24
500: 1
Duplicate Documents: 51
Total Documents Collected: 561
Documents by mimetype:
text/xml: 1 image/jpeg: 1
text/html: 451 image/gif: 1
text/plain: 106
application/octet-stream: 1
![Page 14: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/14.jpg)
SLASHPack Collector - 5/4/2006 14
Challenges Large text (XML) files
21 1 GB XML files. ~450,000 files per XML file.~10 Million files, after processing.
Memory/StorageDisk space.Memory usage during processing. (XML)
![Page 15: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/15.jpg)
SLASHPack Collector - 5/4/2006 15
Weblog raw data <post>
<weblog_url> http://www.livejournal.com/users/chuckdarwin </weblog_url> <weblog_title> ""Evolve!"“ </weblog_title> <permalink>http://www.livejournal.com/users/chuckdarwin/1001264.html</permalink> <post_title> Flickr </post_title> <author_name> Darwin (chuckdarwin) </author_name> <date_posted> 2005-07-09 </date_posted> <time_posted> 000000 </time_posted> <content> <html><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/><title>""Evolve!""</title></head><body><div style="text-align: center;"><font size="+1"><a href="http://www.nytimes.com/2005/07/09/arts/09boxe.html?ei=5088&amp;en=61cfcd5835008b1a&amp;ex=1278561600&amp;partner=rssnyt&amp;emc=rss&amp;pagewanted=print">7/7 and 9/11?</a></font></div></body></html></content><outlinks>
<outlink> <url> http://www.nytimes.com/2005/07/09/arts/09boxe.html </url> <site> http://www.nytimes.com </site> <type> Press </type></outlink>
</outlinks> </post>
![Page 16: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/16.jpg)
SLASHPack Collector - 5/4/2006 16
Weblog processed data<spdata>
<url>http://www.livejournal.com/users/chuckdarwin</url><date>20060212</date><crawlname>WeblogPosts20050709</crawlname>
<weblog> <weblog_title>”"Evolve!""</weblog_title> <permalink>http://www.livejournal.com/users/chuckdarwin/1001264.html</permalink> <post_title>Flickr</post_title> <author_name>Darwin (chuckdarwin)</author_name> <date_posted>2005-07-09</date_posted>
<time_posted>000000</time_posted> <outlinks> <outlink> <type>Press</type>
<url>http://www.nytimes.com/2005/07/09/arts/09boxe.html</url><site>http://www.nytimes.com</site>
</outlink> </outlinks>
</weblog> <tags></tags> <size>493</size> <mimetype>text/plain</mimetype> <fingerprint>9949bba4ac535d18c3f11db66cdb194e</fingerprint> <content>Jmx0O2h0bWwmZ3Q7CiZsdDtoZWFkJmd0OwombHQ7bWV0YSBjb250ZW50P
….</content></spdata>
![Page 17: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/17.jpg)
SLASHPack Collector - 5/4/2006 17
Original Design
![Page 18: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/18.jpg)
SLASHPack Collector - 5/4/2006 18
Problems to Address Overall collection performance
Streamline processing.
Robot file look up Incredibly slow and inefficient. (Not mine!)
Thread interaction Efficient use of threads and queues to process data.
Inefficient code Python code not always the fastest. miniDom XML parsing.
Faster data structures Re-work collection protocols, DNS prefetch. Re-structure URL Frontier, URL Bookkeeper.
![Page 19: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/19.jpg)
SLASHPack Collector - 5/4/2006 19
New Design
![Page 20: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/20.jpg)
SLASHPack Collector - 5/4/2006 20
Performance Modifications Structure Re-design (threading)
More queues, more independence.
Robot Parser String creation, debug calls.
URL Frontier More efficient data structures.
Protocol Modules More efficient data structures. Re-factoring for reliable collection.
XML parsing Switch to faster parser, removal of DOM parser.
DNS Pre-fetching More efficient structuring.
![Page 21: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/21.jpg)
SLASHPack Collector - 5/4/2006 21
New data structures
Dictionary fields for Base data type. (Must be implemented by any data protocol).Now passed in dictionary to storage component.
Key Value Typedatatype user defined datatype name string status HTTP document status stringurl URL of document stringdate collection date stringcrawlname name of current crawl stringsize byte length of content stringmimetype mime type of document stringfingerprint md5sum hash of content stringcontent raw text of document string
![Page 22: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/22.jpg)
SLASHPack Collector - 5/4/2006 22
Outline
1. Introduction, system overview and design.
2. Performance modifications, re-factoring and re-structuring.
3. Performance testing results and evaluation.
![Page 23: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/23.jpg)
SLASHPack Collector - 5/4/2006 23
Performance ComparisonInitial Results:
Weblog data setw/o parsing, robots:
161 doc/s, 50 min.
w/ parsing, robots:
3.9 doc/s, 162 min. (killed)
HTTP Web crawl100 docs w/ parsing, robots:
0.2 doc/s,16 min:13s
150 docs w/ parsing, robots:
0.3 doc/s, 21min:3s
Modified Results:
Weblog data setw/o parsing, robots:
170 doc/s, 42 min.
w/ parsing, robots:
186 doc/s, 63 min.
HTTP Web crawl100 docs w/ parsing, robots:
2.2 doc/s, 1min:10s
150 docs w/ parsing, robots:
2.9 doc/s, 1min:14s
![Page 24: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/24.jpg)
SLASHPack Collector - 5/4/2006 24
Performance Comparison (cont.) Hardware considerations
- HTTP web crawl for 500 documentsPentium 4 2.4GHz 1 GB RAM
3.7 doc/s 3min:18s, 728 docs total
(faster connection)
Pentium 4 2.0GHz, 1GB RAM 3.7 doc/s 4min:25s, 725 docs total
Pentium 4 3.2GHz HT, 2GB RAM 4.3 doc/s 2min:47s, 717 docs total
(faster connection)
![Page 25: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/25.jpg)
SLASHPack Collector - 5/4/2006 25
Performance Comparison (cont.) Comparison to other web crawlers
(published results, 1999) Google: 33.5 doc/sInternet Archive: 46.3 doc/sMercator: 112 doc/s
Consideration of functionalityMore than just a web crawler.Mime types.
![Page 26: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/26.jpg)
SLASHPack Collector - 5/4/2006 26
Available Documentation Pydoc API
Generated with Epydoc.
Use and configuration guide (README).Quick start guide.
Full ReportFull specification of Collector, use,
configuration and development background.
![Page 27: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/27.jpg)
SLASHPack Collector - 5/4/2006 27
Future Work
Addition of pluggable modules.
Improved fingerprint sets.
Improved Python memory management
and threading.
![Page 28: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/28.jpg)
SLASHPack Collector - 5/4/2006 28
References
Allan Heydon and Marc Najork. Mercator: A scalable, extensible web crawler. http://research.compaq.com/SRC/mercator/papers/www/paper.pdf
Soumen Chakrabati, Mining the Web, 2002. Ch. 2, pages 17-43.
Heritrix, Internet Archive. http://crawler.archive.org/
Python Performance Tips http://wiki.python.org/moin/PythonSpeed/PerformanceTips
Prof. Chris Brooks and the SLASHPack Team.
![Page 29: SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.](https://reader031.fdocuments.us/reader031/viewer/2022013012/56649d3f5503460f94a195a8/html5/thumbnails/29.jpg)
SLASHPack Collector - 5/4/2006 29
Conclusion Four stages:
Addition of protocol module for Weblog data set. Performance testing and identifying problem areas. Modify Collector to improve scalability and
performance. Repeat performance testing and evaluate
performance improvements.
Results: Expanded functionality for data types. Modifications improved performance. More stable and flexible design.