The Biodiversity Heritage Library

63
KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008 The Biodiversit y Heritage Library Martin R. Kalfatovic Suzanne C. Pilsk Smithsonian Institution Libraries 30 January 2008

description

Talk given January 30, 2008 at the National Agriculture Library by Martin Kalfatovic and Suzanne Pilsk

Transcript of The Biodiversity Heritage Library

Page 1: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

The Biodiversity Heritage Library

Martin R. KalfatovicSuzanne C. PilskSmithsonian Institution Libraries30 January 2008

Page 2: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Yet another physical difficulty is the task of assembling the library and indexes which will enable the student to work under proper conditions…. the beginner must now be prepared to spend liberally, or else must establish himself in an institution where a large library exists; if he work by himself with only a few books, he will have to confine himself to a very narrow specialty indeed.

'The Limitations of Taxonomy' by J.M. Aldrich, Science, April 22, 1927, vol. LXV, no. 1686, p.381

Page 3: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

2003. Telluride. Encyclopedia of Life meeting

February 2005. London. Library and Laboratory: the Marriage of Research, Data and Taxonomic Literature

May 2005. Washington. Ground work for the Biodiversity Heritage Library

June 2006. Washington. Organizational and Technical meeting

August 2006. New York Botanical Garden. BHL Director’s Meeting.

October 2006. St. Louis/San Francisco. Technical meetings

February 2007. Museum of Comparative Zoology. Organizational meeting

May 2007. Encyclopedia of Life and BHL Portal Launch. Washington DC.

BHL Timeline

Page 4: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

American Museum of Natural History (New York)

Field Museum (Chicago)

Natural History Museum (London)

Smithsonian Institution (Washington)

Missouri Botanical Garden (St. Louis)

New York Botanical Garden (New York)

BHL Members

Page 5: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Royal Botanic Garden, Kew

Botany Libraries, Harvard University

Ernst Mayr Library of the Museum of Comparative Zoology, Harvard University

Marine Biological Laboratory / Woods Hole Oceanographic Institution

BHL Members

Page 6: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

BHL Members

University of Illinois, Urbana-Champaign (contributing member)

Scheme for addition of European and Asian partners under consideration

Additional categories of membership under consideration

Page 7: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

BHL Focus: Literature

Page 8: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

BHL Focus: Literature

Page 9: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

• Core literature pre-1923: 100 million pages (?)

• All pre-1923: 120-150 million pages

• All literature: 280-320 million pages

BHL Focus: Literature

Page 10: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Page 11: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Selection Tools

Combined Serial list for selection of title to scan to avoid duplication of effort

Monographic “de-duping” algorithm

OCLC Collection Analysis

Page 12: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

• 1.3 million catalogue records • 73% are monographs

(remainder are serials at title-level)

• 63% is English language material

• The next most popular language (9%) is German

• About 30% of material was published before 1923

BHL Collections

Page 13: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Selection Marine Biological

Laboratory/WHOI Marine monographs General Science

Museum of Comparative Zoology MCZ publications Herpetology monographs

and serials Ichthyology monographs and

serials

Page 14: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Selection University of Illinois

Fieldiana Natural history of Illinois

American Museum of Natural History AMNH publications Ornithology

Natural History Museum NHM publications Major natural history general

serials

Page 15: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Selection Botany Collections

Missouri Botanical Garden, New York Botanical Garden, Harvard Botany Libraries, and Royal Botanic Garden, Kew will cooperatively develop a methodology for botanical publications

Page 16: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Selection Smithsonian Institution

Libraries Smithsonian publications Entomology collection Marine mammals Fishes Selected special collections

materials

Page 17: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Page 18: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

The Internet Archive

• 501(c)(3) organization• Dedicated to “Universal Access to

Human Knowledge”• Founder of the Open Content

Alliance• Provides:

– Mass scanning– Archival storage of files– Image processing– Technology development

Page 19: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Scribe Scanner

• Single Scribe Machine– Custom built by the

Internet Archive– Human operated– 3,500 page per shift per

day

Page 20: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

BHL Scanning Centers

Northeast Regional Scanning Center 10 Scribe machines MBL/WHOI Harvard

New York Public Library 10 Scribe machines AMNH NYBG

Page 21: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

BHL Scanning Centers

University of Illinois 2 Scribe machines

Natural History Museum, London 1 Scribe machine

Missouri Botanical Garden Non-Scribe operation

Page 22: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

BHL Scanning Centers

Washington, DC 1 Scribe machine at

Smithsonian Libraries 10 Scribe facility at

Library of Congress with Fedlink under construction (Spring 2008)

Page 23: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Scanning Stats

5.5 million plus total pages scanned

500,000 plus from the Natural History Museum, London

1,000,000 from the MBL/WHOI library

Fieldiana, 75,000 plus pages

Page 24: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Scanning Stats

Smithsonian Libraries 250,000 pages (non-Scribe

scanned, 1996-2007) 100,000 Scribe scanned

pages (since August 2007) Other libraries (non-Scribe)

MOBOT: 780,000 AMNH: 150,000

Page 25: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Page 26: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

But what about ...

Page 27: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Page 28: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Difficult (impossible?) to re-purpose much of the material

Quality of images often questionable

“Frankenbooks” Sketchy / inaccurate

bibliographic data

But what about

Page 29: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Persistent IdentifiersStable URLHandleDOIBICI/SICI ISSN ISBNLSIDs

http://www.biodiversitylibrary.org

Page 30: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Structural Markup<article>  <title>A BRIEF CONSIDERATION OF

CERTAIN POINTS IN THE MORPHOLOGY OFTHE FAMILY CHALCIDID^E.*.</title>

  <author>L. O. HOWARD.</author>   <volume>1</volume>   <issue>2</issue>   <start_page>65</start_page>   <end_page>86</end_page>   <start_count_page>85</start_count_page>   <end_count_page>106</end_count_page>  

<start_page_image_file>3908800908001101smthrich_0085.djvu</start_page_image_file>

  <end_page_image_file>3908800908001101smthrich_0106.djvu</end_page_image_file>

  </article>

Page 31: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Semantic Markup

GoldenGATEThe intention of the GoldenGATE editor is to build a bridge between NLP components and XML markup of natural language text according to arbitrary XML schemas. It allows the deployment of NLP components to marking up the bodies of literature they were designed for. In this way, it enables transforming the texts into XML content according to an XML schema that was designed to gain maximum benefit from the knowledge provided in them.

Integrated Open Taxonomic Access (INOTAXA)

Page 32: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

10.7 million name strings in NameBank

Uses sophisticated algorithm (TaxonGrab) to locate likely name strings in OCR text

Iterative processing of BHL texts will both increase the number of name strings in NameBank and increase the accuracy of name string recognition

Taxonomic Intelligence

Page 33: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Page 34: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Page 35: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Page 36: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

BHL & Publishers

Page 37: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Permissions

• Seek permissions from copyright holders

• Opt in Copyright Model: The BHL will actively work with professional societies and associations to integrate their publications into the BHL in a way that serves the societies’ missions and goals

• BHL will digitize learned society backfiles and mount them through the BHL Portal at no cost.

• Will provide a set of files to the publishers for reuse as they see fit.

Page 38: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

BHL Advantages• Use of the articles will increase

as evidenced by citation upsurge• Long-term management of the

digital assets is provided by the BHL at no cost

• Publishers’ content is embedded in the emerging knowledge ecology that is sweeping biology in this century

• Structural markup of backfiles into conformance with NLM DTD (just starting)

Page 39: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Successes

• Entomological News• Journal of Hymenoptera

Research

• Herpetological Review

• Publications of the San Diego Natural History Museum

• California Academy of Sciences publications

• And more ...

Page 40: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

BHL Portal• Library catalog-like interface

to BHL literature• Enhanced structural

analysis to provide volume/issue/article page access to the literature

• Iterative development based on feedback from user community

• Provide access to two key audiences:–Humans–Machines

Page 41: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Page Delivery

Page 42: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Taxonomic Intelligence

Page 43: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Search Browse

Web 2.0 Features

Page 44: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Discovered Bibliographies

Page 45: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Page 46: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Demos

BHL Portalwww.biodiversitylibrary.org

uBiowww.ubio.org

Page 47: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Page 48: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Initial grant from the MacArthur and Sloan Foundations (as part of the Encylopedia of Life grant)

Additional support from parent institutions

Additional grants being actively pursued by BHL and individual members

Funding & the Future

Page 49: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

• Co-evolving bioinformatics resources produce a rich information ecology:

– Consortium for the Barcoding of Life (CBOL) with gene sequences deposited in GenBank.

– GBIF’s Electronic Catalog of Taxonomic Names

– Hebaria and museum specimen databases

Funding & the Future

Page 50: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Financial Sustainability Strategy

• Quick ramp-up high early costs – development, mass scanning, etc. Drive long-term costs down the asymptote toward zero.

• Derive some long-term costs from the operating budgets of the member institutions. (examples under consideration: acquisitions budget, staff positions, etc.)

Funding & the Future

Page 51: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Financial Sustainability Strategy

• Integrate functions/tasks with wider efforts where appropriate, e.g. mass storage.

• Clear roles for staff who wear multiple hats. 1.5 grant funded positions currently but >15 staff who make substantive contributions.

Funding & the Future

Page 52: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

The Long Now Strategy

Institutions that are creating the BHL exist to persist through time. That’s an important part of their business. Use them!

The future is uncertain, the technology landscape changes, people pass on. So create consortial structures that are low-overhead, flexible, and can respond quickly

Funding & the Future

Page 53: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Page 54: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Page 55: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Structure of the Encyclopedia of Life

Serine Molecule

Page 56: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Serine Molecule

Synthesis CenterField Museum

BiodiversityHeritageLibrary

SecretariatSmithsonian Education &

OutreachSmithsonian/Harvard

InformaticsMarine Biological

Laboratory & MOBOT

Page 57: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

EOL Species Pages

Built from a variety of new and existing sources

Views available for varying levels of expertise from novice to expert

Legacy literature a key component of the EOL species pages

Page 58: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Page 59: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Page 60: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Page 61: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

In any well-appointed Natural History Library there should be found every book and every edition of every book dealing in the remotest way with the subjects concerned.

Charles Davies Sherborn, Epilogue to Index Animalium, March 1922

Page 62: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Thank You ... for sticking around!

Page 63: The Biodiversity Heritage Library

KALFATOVIC and PILSK :: SMITHSONIAN INSTITUTION LIBRARIES :: NATIONAL AGRICULTURE LIBRARY :: 30 JANUARY 2008

Thanks to:

Chris Freeland, Missouri Botanical Garden

Tom Garnett, The Biodiversity Heritage Library Project

The staff at the Internet Archive

Images from

The Galaxy of Images, Smithsonian Libraries (www.sil.si.edu/imagegalaxy)

Martin R. Kalfatovic

Suzanne C. Pilsk

Bernard Scaife

CREDITS