© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Biodiversity Heritage Library (BHL):Technology Overview
Chris FreelandDirector, Bioinformatics
Missouri Botanical Garden
Technical DirectorBiodiversity Heritage [email protected]
www.biodiversitylibrary.org
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
BHL Partners
Museums– American Museum of
Natural History (New York)
– Natural History Museum (London)
– Smithsonian Institution (Washington)
– The Field Museum (Chicago)
Botanical Gardens– Missouri Botanical Garden– New York Botanical Garden– Royal Botanic Garden, Kew
University Libraries– Botany Libraries, Harvard University– Ernst Meyer Library of the Museum
of Comparative Zoology, Harvard University
– University of Illinois
Bioinformatics Institutes – MBL/WHOI– uBio.org
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Why have BHL?In any well-appointed Natural History Library there should be found every book and every edition of every book dealing in the remotest way with the subjects concerned. One never knows wherein one edition differs from or supplements the other and unless these are on the same table at the same time it is not possible to collate them properly. Moreover for accurate work it is necessary for the student to verify every reference he may find; it is not enough to copy from a previous author; he must verify each reference itself from the original.
Charles Davies Sherborn, Epilogue to Index Animalium, March 1922
Charles Davies Sherborn (1861-1942)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Unique Components of BHL
• Combining metadata records from multiple libraries (similar, but different) and representing through a shared portal
• Use of JPEG2000• Web 2.0 Mashups• Taxonomic data mining• Services• Rare & novel content
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Scanning process
1. Select Book2. Pull from Shelf3. Send to IA scanning center4. Book is scanned & QA5. Page images loaded on IA cluster
1. Derivatives created
6. Book returned to library7. Files harvested from IA portal8. Books available for display within BHL portal
Mushrooms of America, edible and poisonous. Ed. by Julius A. Palmer, Jr. , 1885.
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Scan & Store: Internet Archive
Scanning on Scribes
Storage in Petaboxes
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Scanning & Derivatives
• XML• JP2
• PDF• JPG• TXT• DJVu
Master Derivatives
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Harvest from IA
Extract, Transform, Load (ETL)
• Custom scripts to extract content via IA’s APIs
• Database scripts to transform to relational data structure
• Load into database
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Stable URL
Attribution
Name Finding
Page Turning Page TurningZoom/Pan
Download/View
Browse
Search
Filter
Target/Object
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
JPEG2000 (*.jp2) display
• RAW original => 85% .jp2
• LuraTech encoder– Wavelet compression
• LizardTech decoder– Tiled on the fly,
cached for performance
• GSIV browser-based client viewer– ‘AJAXian’
LizardTech ExpressServer
Browser GSIV.js
www.biodiversitylibrary.org
.jp2
.jpg
IA
/page/1274907
pageid: 1274907
BHLdb
http://www.archive.org/download/mushroomsofameri00palm/.../mushroomsofameri00palm_0010.jp2
images.mobot.org
A user requests Mushrooms of America, edible and poisonous, Plate X:http://www.biodiversitylibrary.org/page/1274907
locate:
BHL/IA architecture
= 5.0+ sec transfer
Time to deliver image: 8+ sec
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Reuse, don’t rebuild
TIF Image from ScannerConverted to text via PrimeOCRName finding via TaxonFinder Extract namesSubmit to NameBankSOAP response
Name Finding in action
with Taxonomic Intelligence…
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Names data mining
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Tag cloud from LCSHSubject Heading from library catalog
Expressed as MARCXML
Tag Cloud
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Geocoding LCSH
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
RSS Feeds
Specific: Last 25 books published in German from NYBGRSS Feed location: http://www.biodiversitylibrary.org/RecentRss/25/GER/NYBG
1. Allgemeine deutsche Garten-Zeitung, 7, 1829 (added: 04/03/2008 ) 2. Zeitschrift fr wissenschaftliche Mikroskopie und fr mikroskopische
Technik. 2, 1885 (added: 03/28/2008 ) 3. Zeitschrift fr technische Biologie. 7, 1919 (added: 03/27/2008 ) 4. …
General: Last 25 books from all librariesRSS Feed location: http://www.biodiversitylibrary.org/RecentRss/25
1. Summa plantarum : v.1 (added: 05/01/2008 ) 2. Vegetable materia medica of the United States (added: 04/30/2008 ) 3. The family herbal; (added: 04/30/2008 ) 4. …
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Services
• Names– v.1 released
http://www.biodiversitylibrary.org/services/name/NameService.asmx
• Stable urls– http://www.biodiversitylibrary.org/bibliography/1652– http://www.biodiversitylibrary.org/name/Carcharodon_carcharias
• Future:– Citation Resolver– Titles Resolver
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
BHL Name Serviceshttp://www.biodiversitylibrary.org/services/name/NameService.asmx
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Provider Integration
• Encyclopedia of Life
• Atrium Andes Biodiversity
• Wikipedia
• EDIT Scratchpads
• More to come…
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Hardware Infrastructure
• Distributed
• Partially redundant– Work needed
• Mixed platforms
• Mixed app frameworks
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
MOBOT
Petabox cluster
Internet Archive
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
File Storage Estimates
• 4MB per page including derivatives
• 1 million pages = 4TB storage
• Expected output:60 – 100 million pages
240 - 400 TB for files
10 - 20 GB for db
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Future Work
• Services– Citation Resolver– Titles Resolver
• Interfaces
• Editing– Authoritative– Community
• Backend
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Fedora
• Funded by Gordon and Betty Moore Foundation to adopt Fedora Commons
• Working with Internet Archive to define use and practice
• Project completionDecember 2009
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Thank You
Chris Freeland
BHL Portal
www.biodiversitylibrary.org
BHL Blog
biodiversitylibrary.blogspot.com
BHL collection at Internet Archive
www.archive.org/details/biodiversity
Top Related