The BHL way to content
-
Upload
william-ulate -
Category
Documents
-
view
520 -
download
2
description
Transcript of The BHL way to content
![Page 1: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/1.jpg)
The BHL way to content
William UlateBHL Technical Director
Global BHL Coordinator
Leiden, NetherlandsFebruary 14, 2013
![Page 2: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/2.jpg)
What is BHL?
The Biodiversity Heritage Library is a consortium of natural history and botanical libraries that
cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to
make that literature available for open access and responsible use as a part of a global “biodiversity
commons.”
The Biodiversity Heritage Library is a consortium of natural history and botanical libraries that
cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to
make that literature available for open access and responsible use as a part of a global “biodiversity
commons.”
![Page 3: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/3.jpg)
Extensive
![Page 4: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/4.jpg)
Global…
![Page 5: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/5.jpg)
New Partners and Geographies
![Page 6: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/6.jpg)
Dear Sir / Madam Can i just congratulate you on an absolutely brilliant online resource. I am compiling a report on an invasive hydromedusae and could not believe the ease and efficiency of this web page which genuinely saved me weeks of my life
Dear Sir / Madam Can i just congratulate you on an absolutely brilliant online resource. I am compiling a report on an invasive hydromedusae and could not believe the ease and efficiency of this web page which genuinely saved me weeks of my life
Research that previously took months now takes only a few hours
Research that previously took months now takes only a few hours
La plus grande #bibliotheque #botanique & #zoologique online The largest online botanical & zoological #library #BHL
La plus grande #bibliotheque #botanique & #zoologique online The largest online botanical & zoological #library #BHL
The freeing of knowledge may lead to new discoveries and changes in the way the natural world is perceived
The freeing of knowledge may lead to new discoveries and changes in the way the natural world is perceived
![Page 7: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/7.jpg)
22.00
40.00
84.86
94.6
105.85
9.2 16.4
31.8 35.4
38.9
-
20
40
60
80
100
120
Oct-08 Oct-09 Oct-10 Oct-11 Oct-12
Pages (Millions) and Volumes (in Thousands) included in BHL
Volumes (K)
Pages (M)
More Online Content
![Page 8: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/8.jpg)
San Francisco
Woods Hole
London
Alexandria
Beijing
Global Replication & ServingReplicated Data Center Portal Application
![Page 9: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/9.jpg)
![Page 10: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/10.jpg)
> 390,000 views in 10 months
> 1200 sets
> 60,000+ images
![Page 11: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/11.jpg)
![Page 12: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/12.jpg)
The Art of Life project: describing and providing access to natural history illustrations from the Biodiversity Heritage Library (BHL)
Title Stictospiza formosa
Type Illustrations
Date Publication: 1898
Agent Author: Arthur G. Butler (1844-1925)Illustrator: F.W. Frohawk (1861-1946)
Description A pair of finches with green and yellow bodies resting on reeds
Subjects Scientific name: Amandava formosa (Latham, 1790) Vernacular Name: Green Avadavat or Green MuniaAccepted Name: Amandava formosa (Latham, 1790) Birds, finches
Inscriptions bottom center: Green Amaduvade Waxbill (Stictospiza formosa)
SourceButler, Arthur Gardiner. Foreign finches in captivity. Hull and London: Brumby and Clarke, limited,1889 (2nd edition). This image comes from the Biodiversity Heritage Library, and is available online at biodiversitylibrary.org/page/17195895
Rights Public domain
Element Definition Examples Repeat
Agents person or corporate entity involved in the creation, design, production, or publication of a visual resource.
<vra:agent> <vra:name type="personal" vocab="LCNAF" refid="89015596> Curtis,John</vra:name> <vra:dates type="life"> <vra:earliestDate>1791</vra:earliestDate> <vra:latestDate>1862</vra:latestDate> </vra:dates> <vra:role vocab="AAT" refid="300025574">publisher</vra:role></vra:agent>
Y
Copyright The copyright status of the visual resource. <vra:rights refid=”http://creativecommons.org/licenses/by-nc/2.0/
deed.en”>Creative Commons Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0)</vra:rights>
N
Date Date or range of dates associated with the creation or publication of the visual resource.
<vra:date type="creation"> <vra:earliestDate>1945</vra:earliestDate> <vra:latestDate>1955</vra:latestDate></vra:date>
Y
Description A free-text note about content of the image, including comments, description, or interpretation, that gives additional information not recorded in other categories.
<vra:description>This illustration shows a scale, coloured illustration of Sepsis annulipes (now known as Encita annulipes) beside the Trifolium ochroleucum plant. Several dissections from Sepsis cylindrica Fab. (all these details are provided on the next page of this book and the subsequent page).</vra:description>
Y
Inscriptions All marks, caption, or written words added to the object at the time of production or in its subsequent history, including signatures, dates, dedications, texts, and colophons, as well as marks, such as the stamps of silversmiths, publishers, or printers.
<vra:inscription> <vra:position>bottom</vra:position> <vra:text>Radula of L. souleyetianum on a more reduced scale</vra:text></vra:inscription>
Y
Source A citation for the book, journal or resource that hosts the visual resource
<vra:source><vra:name type=”book”>Butler, Arthur Gardiner. Foreign finches in captivity. HullBrumby and Clarke, limited,1889 (2nd edition). </vra:name> <vra:refid type=”URI”>http://biodiversitylibrary.org/page/17195895</vra:refid> </vra:source>
N
Subject Terms or phrases that describe, identify, or interpret the visual resource.
<vra:subject><vra:term type=”personalName”>Carl Linnaeus</vra:term></vra:subject>
<dwc:scientificName>Plant: Picea abies</dwc:scientificName> <dwc:acceptedName>Plant: Picea abies</dwc:acceptedName> <dwc:vernacularName>Plant: Norway spruce<dwc:vernacularName>
Y
Title The title or identifying phrase given to an Image <vra:title xml:lang=”la”>Sepsis annulipes</vra:title>
<vra:title type=“alternate”>Orangutan</vra:title>Y
Type Identifies a general category for the visual resource
<vra:type>maps</vra:type><vra:type>forestry maps</vra:type>
Y
Example of illustration described using Art of Life schema
Art of Life schema elements required in Red
We welcome your feedback on the schema! http://tinyurl.com/9hm7nsb
![Page 13: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/13.jpg)
![Page 14: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/14.jpg)
![Page 15: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/15.jpg)
![Page 16: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/16.jpg)
Where are we?• Scientific Name Extraction
– Improved algorithm (Thanks uBio!)• Articles
– Extended BHL data model to store article metadata– Content and Process to harvest data from BioStor in place
• Create user interfaces for adding article metadata and associated files– Functional requirements defined– Process flow for adding article metadata and associated files– Implement UI changes
• Change BHL UI to accommodate article search• Change BHL UI to accommodate article display (TOC)
![Page 17: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/17.jpg)
Scientific Name Extraction
• TaxonFinder algorithm in production since 2008– More than 100 million candidate name strings– More than 1.5 million unique, verified names– Available through UI, APIs, Data Exports & Internet
Archive• New collaboration with Global Names
– Improved algorithm, better precision & recall– More data with TaxonFinder and Neti Neti!
![Page 18: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/18.jpg)
Taxon NamesBEFORE Name Instances 101,591,803 101,288,804Unique Names 7,498,554 7,464,924Verified Names 1,905,507 1,902,803EOL Names 63,130,350 62,963,582EOL Pages 13,579,868 13,532,684 AFTER Name Instances 151,222,182 150,066,425Unique Names 29,246,382 29,091,767Verified Names 10,153,165 10,109,540EOL Names 87,791,695 87,135,089EOL Pages 15,466,713 15,342,867
![Page 19: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/19.jpg)
Part-level metadata
• Disambiguating and locating structural components in the corpus
• Done by automated and crowdsourced means– Thanks Rod Page! Welcome others!
• Greatly increases semantic value of the dataset• Addressing important – makes data addressable
and thus linkable
![Page 20: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/20.jpg)
Articles in the BHL UI
![Page 21: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/21.jpg)
Images
![Page 22: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/22.jpg)
PDF Generator
![Page 23: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/23.jpg)
Support citation reconciliation
.
.
.
.
.
.
.L. Sp. Pl. 2: 971. 1753
Linneaus, C. Species Plantarum, vol. 2 p. 971. 1753
Linné, Carl von. Sp. Pl. Vol. 2 Page 971. 1753
Caroli Linnaei, Species Plantarum exhibentes plantas rite cognitas, ad genera relatas, cum Differentis Specificis, Nominibus Trivialibus, Synonymis Selectis, Locis Natalibus, secundum SYSTEMA SEXUALE digestas.. 2:971. 1753
Zea mays
![Page 24: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/24.jpg)
Citations Providers
![Page 25: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/25.jpg)
What we’d like to dohttp://biodivlib.wikispaces.com/BHL+and+Gaming
• Improve OCR• Rekeying Tables of Contents• Researching candidate Scientific Names• Image identification & extraction
– http://biodivlib.wikispaces.com/Art+of+Life – Currently funded by NEH
^Challenges framed as games
![Page 26: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/26.jpg)
2007 Name Finding Study
>35% OCR error rate for names only
1 Insert Space 8 n->v
2 Omit Space 9 l->i
3 e->c 10 r->i
4 u->I 11 u->ii
5 u->n 12 h->l
6 i->l 13 h->ii
7 c->e 14 e->o
Top OCR errors
35.16%
Of the 3,003 names, 1,056 were incorrectly transcribed by OCR.
Wei, et al. An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library. Proceedings of TDWG. 2008.http://www.tdwg.org/proceedings/article/view/380
![Page 27: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/27.jpg)
Abbild ungen und Beschreibungen der
Fische Syriens, nebst
einer neuen Classification und Characteristik sämmtlicher Gattungen
der i
JOH. JAKOB HECKEL, Inipectoi am k. k. Hof-Natur.-iUenkabinete in
Wien, mehr, yelelirt. UeHtllMeii. MIfglivd.
STUTTGART. E. Schweizerbart' sehe Verlagshandlung,
1843.
![Page 28: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/28.jpg)
Older material
• Great deal of material is pre-1923 • Irregular fonts – blackletter• Multiple languages on same page – English
text with Latin scientific names• Changes in geographic names• Changes in scientific names
![Page 29: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/29.jpg)
*E.xvi c piteI von c. cXx.WptdvonfnrWmn � �bu fbe;bcn.5 am cix bIa S &3rn~ 41X � �a m cv(f b1air 'o et ert oiensr ; � � � �
', : hlrfc c wa ff 4am.diug bist a� � � �6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn ciblatGteaM �w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl Oiff ;Bruet wacfttc n qmcx b1a bl: �bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t B Rn " trv W1Rt' ?Cm c blas � �waIwutr Ober ci ti 1V Ces ' wt �gbtiemwwajfu tpctt, afferain 9 c: b titbfof �
r f eran m rs bra wlg auig4;f aer m *mc vrt � �blatcabtfm wfru an'deg~m rt blas Iaum bwWt run f ncmai b14ianf tJobrrfan �ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W e &mcyfbq4 Mabtt mmw � �rc a iiu bc Jcn ncI.end.*, blat s. a\ u: rprd3 �rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
![Page 30: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/30.jpg)
Expanding scope
• Manuscripts, field notebooks –mostly handwritten, often with drawings
• Global expansion means dealing with non-Western script systems and a whole new set of OCR problems – Arabic materials from Bibliotheca Alexandria in Egypt
![Page 31: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/31.jpg)
Images
![Page 32: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/32.jpg)
OCR Improvements
• Gaming• Transcription
![Page 33: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/33.jpg)
OCR Improvements
• Transcription• Purposeful Gaming• Crowdsource Markup
![Page 34: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/34.jpg)
Transcribe Bentham• A collaboration of the University of London Computer Centre,
UCL Library Services and UCL Learning and Media Services with consultation from the UCL Centre for Digital Humanities
• Volunteer users can log-in and transcribe previously unstudied and unpublished manuscripts from the Bentham Papers collection in UCL Library's Special Collections in the Transcription Desk.
• Since launch, volunteers from around the world have transcribed several thousand Bentham manuscripts to an extremely high standard.
• Results and findings: http://www.digitalhumanities.org/dhq/vol/6/2/000125/000125.html
![Page 35: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/35.jpg)
Transcribe Bentham• Who were the volunteers?
• http://www.digitalhumanities.org/dhq/vol/6/2/000125/000125.html
![Page 36: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/36.jpg)
Transcribe Bentham• Age ranges
• http://www.digitalhumanities.org/dhq/vol/6/2/000125/000125.html
![Page 37: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/37.jpg)
http://blog.winepresspublishing.com/2011/05/pubtoons-23-angry-books/
![Page 38: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/38.jpg)
Purposeful Gaming
Space Climate
Humanities
Nature Biology
![Page 39: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/39.jpg)
Purposeful GamingDIGITALKOOT
• Joint project run by the National Library of Finland and Microtask to index the library's enormous archives so that they are searchable on the Internet for easier access to the Finnish cultural heritage.
• Launched on Feb 8 2011, nearly 110 000 participants completed over 8 million word fixing tasks by Nov 29 2012
• DigiTalkoot enabled volunteers to participate in this fixing work by playing games.
![Page 40: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/40.jpg)
Purposeful GamingDIGITALKOOT
• Joint project run by the National Library of Finland and Microtask to index the library's enormous archives so that they are searchable on the Internet for easier access to the Finnish cultural heritage.
• Launched on Feb 8 2011, nearly 110 000 participants completed over 8 million word fixing tasks by Nov 29 2012
• DigiTalkoot enabled volunteers to participate in this fixing work by playing games.
![Page 41: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/41.jpg)
Purposeful GamingDIGITALKOOT
• Joint project run by the National Library of Finland and Microtask to index the library's enormous archives so that they are searchable on the Internet for easier access to the Finnish cultural heritage.
• Launched on Feb 8 2011, nearly 110 000 participants completed over 8 million word fixing tasks by Nov 29 2012
• DigiTalkoot enabled volunteers to participate in this fixing work by playing games.
![Page 42: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/42.jpg)
OCR Improvements
German text interpreted by the OCR process as: “unb auf ben ©elnrgen be6 fublic{)en”
![Page 43: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/43.jpg)
OCR Improvements
Different resulting texts from parsing the phrase:“und auf den Gebirgen des südlichen Deutschlands”
(“and on the mountains of southern Germany”)
IA OCR OCR 2 Transcription 1
Transcription 2
1 unb und und und Ok
2 den ben den den Ok
3 ©elnrgen ©ebirgen Bebirgen Gebirgen X
4 be6 des de5 des Chk
5 fublic{)en fublichen Füdlichen Südlichen X
6 £)eittfc{)(anb6 Deutfchlanbs Deutfchlands Deutschlands X
![Page 44: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/44.jpg)
Crowdsource Markup
Display text Species Profile Model category
General/summary TaxonBiology
Geographic range Distribution
Habitat Habitat
Food sources and feeding behavior TrophicStrategy
Physical description (general) Description
Physical description (detailed morphology) DiagnosticDescription
![Page 45: The BHL way to content](https://reader034.fdocuments.us/reader034/viewer/2022051322/54629714af7959236a8b4897/html5/thumbnails/45.jpg)
Thank youWilliam UlateGlobal BHL Project Manager / Technical DirectorMissouri Botanical [email protected]: william_ulate_r