bixo-intro
-
Upload
email112302 -
Category
Documents
-
view
218 -
download
0
Transcript of bixo-intro
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 1/23
Bixo - a webcrawler toolkitKen Krugler, Stefan Groschupf
Tambako the [email protected]
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 2/23
Agenda
Overview
Background
Motivation
GoalsStatus
Differences
Architecture
Data life cycleRobust Testing
Resources
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 3/23
Primary users will be companies extracting datafrom the web (not search)
Interested in subset of the web
Typically part of larger data processing system
Overview
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 4/23
No good solution available
We need a toolkit
Missing from Nutch et al.
Easy to integrate
Easy to extend
Easy to understandAPI vs CLI
Pluggable I/O
Avoid common problems
Spider traps & link farms
Slow servers
Hanging crawls
Motivation - tech
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 5/23
Screen scrape, data extraction
Artist websites, e.g. concert dates
Many pages from large sites
Just crawl, no index
One of many inputs into Business Intelligence
Integration in larger BI system (Cascading-based)
Motivation - EMI
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 6/23
Focused index for key partners
Data analysis and mining of 100m pages
Integration into existing log analysis and datamining systems (Cascading-based)
Low IT/Ops support requirements
Motivation - Share This
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 7/23
GoalsFulfill key motivating requirements
OSS project with business-friendly license
Focus on vertical crawling, leverage other projects
Efficient execution in EC2/cloud environment
Grow OSS community
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 8/23
Current StatusWe already do crawls in EC2
2 sponsored developers, since March 2009
MIT license
Todo:
Improve robots.txt handling
Bugfixes and many improvementsWebsite & documentation
A CLI for easy testing.
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 9/23
Differences (from Nutch)Toolkit versus system - building blocks, notplugins
Workflow focus, versus system where you setconf and run a command
More emphasis on instrumentation - monitoring,error handling,
No search serving
Vertical crawl, not intranet or whole web
HTTP(S) only, not ftp, etc.
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 10/23
Differences (from Hadoop)
Not much, which is a good thing
Generates lots of data - want to store in S3,want to minimize writes
Heavy user of DNS server - extra set up forcaching server
Fetch phase is unusual Cascading topology
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 11/23
Hadoop IntroOpen Source map reduce system
Execution layer - map reduce
Mapper, Reducer Tasks
Storage layer - (distributed) file system
Local FS, HDFS, S3, etc
Scales from single node to thousands
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 12/23
Cascading IntroData processing can be hard with Hadoop
Cascading extends Hadoop
Provides simple data processing API
Reusable (unix) pipe based concept
Sources and Sinks separated
HDFS, Hbase, JDBC, Aster etc.Assemble Pipes, Source and Sink in a Flow
GPL or OEM, though might change
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 13/23
Architecture
Hadoop
Cascading
Bixo pipes
your java your groovy your jython
input output
single jvm server cluster
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 14/23
Data life cycleInject URLs in URL DB
Select URLs from URL DB -
based on recrawl policy, orpartner/domain, or type, etc
Normalize URLs
Score URLs
Group URLs
Fetch
Save content
and/or update URL DB
and/or analyze/parse content
Notice nothing aboutindexing, pushing out index,serving up index.
Meta data fully supported
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 15/23
Architecture - Pipes
fetch pipe parse pipe update url db pipeurl pipe
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 16/23
Import Url Pipe
Import SubAssembly
Each
URL Normalizing
IUrlFilter
Source
URL DB
Sink
URLs
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 17/23
Fetch Pipe
Fetch SubAssembly
Each
URL Domain Map
Each
URL Scoring
GroupBy
URL Grouping
Every
Fetching
GroupingKeyGenerator IHttpFetcherScoreGenerator
URLs
Source
Pages & Status
Sink
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 18/23
Parse Pipe
Parse SubAssembly
Each
URL Domain Map
IParser
Pages
Source
ParsedText & OutLinks
Sink
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 19/23
Update Pipe
Update DB SubAssembly
Each
URL Normalizing
GroupBy
URL Grouping
Every
URL Selection
IUrlFilter
URLs
Source
URL DB
Sink
LastUpdated
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 20/23
Output
MultiSinkTap
Sink
Each
URL Status
Each
URL Content IndexScheme
Sink Each
Lucene Index
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 21/23
Robust testingUnit tests
Jetty with special request
handlers
wrong content type
slow responses
wronger header
WebGraph test platform
test/simulate URL discovery
Looping/URL DB updates
page rank calcs, etc.
Wikipedia
large amount of data that canbe "crawled" via local setup
http://webgraph.dsi.unimi.it/
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 22/23
Resources
Web: http://bixo.101tec.com/
List: http://groups.yahoo.com/group/bixo-dev
Sources: https://github.com/emi/bixo/tree
Bugtracking:
http://oss.101tec.com/jira/browse/bixo
Friday, May 22, 2009
8/7/2019 bixo-intro
http://slidepdf.com/reader/full/bixo-intro 23/23
Scale Unlimited, Inc.
Ken Krugler, Stefan Groschupf
hans [email protected]