Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

28
Archive-It: Scaling Beyond a Billion Archival Web-pages Aaron Binns, Internet Archive [email protected], 2011-10-19

description

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011 Description of Archive-It, the Internet Archive's subscription, self-serve web archiving service, focused on the full-text search system. With nearly 200 partners and over 2000 collections the custom Lucene-based system handles 3+ million index updates per day across an index that totals over 1.3 billion documents. This session will give a detailed description of the architecture and implementation of the Archive-It search system; highlighting many of the challenges due to the scale as well as complex use cases.

Transcript of Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

Page 1: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

Archive-It: Scaling Beyond a Billion Archival Web-pages

Aaron Binns, Internet Archive [email protected], 2011-10-19

Page 2: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

My Background §  Aaron Binns ([email protected]) §  Internet Archive §  Senior Software Engineer §  Full-text search & cool stuff

•  Full-text search •  Hadoop •  “Big Data”

•  http://github.com/aaronbinns

2

Page 3: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

Internet Archive §  Universal access to all knowledge §  http://archive.org §  Founded 1996 §  501(c)(3) non-profit org. §  Digital Library §  San Francisco, CA, USA §  7+ PB of publicly accessible digital materials –  Web archive –  Books, music, video, etc.

3

Page 4: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

4

§  http://web.archive.org §  165,000,000,000+ archived web pages –  HTML –  Images –  CSS –  JavaScript –  Multimedia

§  1996-today

Page 5: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

5

Page 6: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

http://archive-it.org §  Subscription web archiving service –  Select websites to harvest, frequency, depth –  Crawling/Harvesting –  Wayback –  Full-text search

§  Customers –  Public, State & University Libraries –  Local governments –  Museums –  Non-Governmental Organizations (NGOs)

6

Page 7: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

Collections & Documents §  Collection –  Web harvest configuration •  URLs to crawl •  Frequency & depth

–  Set of documents archived •  Access via Wayback Machine •  Full-text search

§  Document –  Unique version of a URL –  “Text” documents: HTML, PDF, Office, etc.

7

Page 8: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

8

Archive-It: Collection

Page 9: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

9

Archive-It: Wayback

Page 10: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

10

Archive-It: Replay July 27, 2002

Sept 15, 2011

Page 11: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

11

Archive-It: Search

Page 12: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

12

Archive-It: Search

Page 13: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

Challenges and....Solutions?

§  Scale §  Archival web search != web search §  Document formats –  HTML (1996....2011) –  PDF, Office, text, etc.

§  English, Français, Español,漢字, … §  Diversity §  Time

13

Page 14: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

Scale

§  200+ customers §  2,272 collections –  Largest: 33,470,659 documents –  24 collections, 10,000,000+ docs –  250 collections, 1,000,000+ docs

§  Total: – 1,375,473,187 unique documents

14

Page 15: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

Scale...each day

§  30-40 simultaneous crawls/harvests §  ~150GB of data: HTML, images, media §  ~1.3 million new unique documents –  New URLs never seen before –  New versions of URLs

§  ~1.3 million updates –  Documents unchanged –  New crawl dates

15

Page 16: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

Architecture §  Offline indexing –  10 dedicated indexing machines –  ~10% of collections per machine –  Add new documents –  Update existing documents with new dates –  1CPU x 2core, 4GB RAM, 3x2TB disk

§  Search service –  11 machines: 1 master, 10 slaves –  ~10% of collections per slave –  1 collection → 1 Lucene index –  1CPU x 2core, 8GB RAM, 3x2TB disk

16

Page 17: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

Diversity

17

Page 18: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

Diversity

18

Page 19: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

Diversity

19

Page 20: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

Field Collapsing / Grouping

20

§  Applied to web documents “Give me the best 1-2 hits from a site”

§  Lucene –  Grouping contrib package

§  Solr –  Field Collapsing

§  What is the performance cost? §  Custom solution

Page 21: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

Time

21

§  User experience & understanding –  Archival web search != web search

§  Information Architecture –  Publication date for web pages – difficult

§  Temporal diversity –  Multiple hits per site –  Multiple versions per URL

Page 22: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

22

Time

Page 23: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

Searching across collections

23

§  Search all collections of a user §  Search arbitrary group of collections §  1 collection → 1 Lucene index – Search 100 collections.... – Search 100 indexes

§  Collections distributed over 10 searchers

Page 24: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

Custom Solutions

24

§  Java §  Built on Lucene §  Investigating Solr –  Capabilities –  Cost

§  Internet Archive –  Open Source –  Apache License –  http://github.com/aaronbinns

Page 25: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

Custom Solutions: Indexing

25

§  http://github.com/aaronbinns/jbs §  Archive-It & other archival web collections §  Hadoop-based, or stand-alone §  Java code with Lucene –  Hard-coded “schema” for web documents –  Title, body, keywords, date, mime-type, etc. –  Link analysis & curation to augment scoring

Page 26: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

Custom Solutions: Searching

26

§  http://github.com/aaronbinns/tnh §  Custom Java web application with Lucene §  Federated search –  1 master, 10 slaves –  OpenSearch

§  Multiple collections & arbitrary grouping §  CollapsingCollector

Page 27: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

CollapsingCollector

27

§  http://github.com/aaronbinns/tnh §  Extends Lucene Collector §  Field cache: “site” §  Retains top N hits per “site” –  Control N via URL parameter

Page 28: Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

Web Archives!

28

§  Archive-It –  http://archive-it.org/

§  US National Archives –  http://webharvest.gov/

§  UK Web Archive –  http://www.webarchive.org.uk/

– Solr-based §  Web Archive of Catalonia / PADICAT –  Biblioteca de Catalunya –  http://www.padicat.cat/