Integrating OCR and NLP to D igitize 2.3 Million Lichen and Bryophyte Specimens

13
Integrating OCR and NLP to Digitize 2.3 Million Lichen and Bryophyte Specimens Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin

description

Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin. Integrating OCR and NLP to D igitize 2.3 Million Lichen and Bryophyte Specimens. Goals and Scope. NSF ADBC (#1115116) ~ 2.3 million specimen 90% of all specimens 900,000 lichens 1.4 million bryophytes - PowerPoint PPT Presentation

Transcript of Integrating OCR and NLP to D igitize 2.3 Million Lichen and Bryophyte Specimens

Page 1: Integrating OCR and NLP  to D igitize  2.3  Million Lichen  and  Bryophyte Specimens

Integrating OCR and NLP to Digitize 2.3 Million Lichen and Bryophyte Specimens

Edward GilbertCorinna GriesThomas H. Nash IIIRobert Anglin

Page 2: Integrating OCR and NLP  to D igitize  2.3  Million Lichen  and  Bryophyte Specimens

Goals and Scope NSF ADBC (#1115116) ~ 2.3 million specimen

90% of all specimens 900,000 lichens 1.4 million bryophytes

> 60 non-governmental US herbaria (95%) Mexico, US, Canada

16 digitization centers

Page 3: Integrating OCR and NLP  to D igitize  2.3  Million Lichen  and  Bryophyte Specimens

Digitization Workflow

Page 4: Integrating OCR and NLP  to D igitize  2.3  Million Lichen  and  Bryophyte Specimens

National Portals

Lichen Consortium http://lichenportal.org 34 Collections 902,664 Records

Bryophyte Consortium http://bryophyteportal/ 26 Collections 1,300,135 Records

Symbiota software

Page 5: Integrating OCR and NLP  to D igitize  2.3  Million Lichen  and  Bryophyte Specimens

Imaging Stage

Capture Image

barcode in file name

Create Skeleton

Filespecies name,

country, state,

exsiccati, etc.

Upload to FTP server

Image processing

extract barcode,

create web versions, map to portal DBs

Herbarium Database

Automated OCR

Tesseract, ABBYY

Existing Record

simply link image

Upload to FTP server

Image URLs

Manage Specimen

Data in Portal

Manage / Review

Records in Portal

SymbiotaEditor

review, edit, keystroke

Create New Record

barcode, image, skeletal data

Automated NLP

Darwin Core Parsing

Page 6: Integrating OCR and NLP  to D igitize  2.3  Million Lichen  and  Bryophyte Specimens

Automated OCR

1. Iterate through “unprocessed” images

2. OCR via Tesseract (version 3)a) In focus, good lighting, minimal noiseb) Resolution: >20px x-height

3. Database raw text block4. Progress to next step

1. Low OCR return => hand processing2. Natural Language Processing

Page 7: Integrating OCR and NLP  to D igitize  2.3  Million Lichen  and  Bryophyte Specimens

OCR Challenges

Issues Old fonts Faded labels Form labels Handwritten

labels Specialized terms

Solutions Image

treatments OCR tuning Dictionaries Consensus OCR

¢_].L.|»‘¢ .'».f.'._..‘~,(.Jfin-x‘*\'a:"511z:1 wf .~\:'i/.onli State UniversityP.’~.r"~2= ,_. gg J:.2 " J*J*" †(=:\‘-“ax "»..'\-12�‘ “ "‘ ;T~;‘~7i?»-1_1_\f;>sf`;,' ESXZ»ie+‘-». “~'.»te;~:i_.t<» ff`t;~f3":.f.“» »4 xx, ,"""‘“â€T"’ <1;-.rs f3'a,1.z>.t;;a¢f~rus ’�V4 J 'if . r°'° M '1?nies ivain.) Sav.neutal Station - " '1 ~»r';;4-\P ` 1.T11 ./P.. ,J ..-.ELEV. ' `.fJL_\ LATL Q _‘ 1 _ Y’ DATE_ ,. W5. (> f- , -:‘; i f>i_T ~~ . A 1:». v\ .-v »~. 4. a xvala 8/27/73

PLANTS OF NEW r~1ExIcoHerbarium of Arizona State UniversityParmelia ulophyllodes (Vain.) Sav.COUNTY “°â€â€œâ€œ �Joranada Experimental Station -New Mexico State University"“““' on JuniperusELEV. ‘ 4400EEILLEETUR DATEDU T. H. Nash #7914 8/27/73T. H. N.

Page 8: Integrating OCR and NLP  to D igitize  2.3  Million Lichen  and  Bryophyte Specimens

Automated NLP

1. Iterate through raw OCR text blocks

2. Parse text block1. Darwin Core 2. Populate database

3. Review1. Adjust content2. Approve3. Handwritten => keystroke

Page 9: Integrating OCR and NLP  to D igitize  2.3  Million Lichen  and  Bryophyte Specimens

NLP Challenges

Issues Variable layouts Loose standards OCR error

Solutions Authority tables Levenshtein

distance Word stats Format

recognition Parsing profiles Duplicate

harvesting

Page 10: Integrating OCR and NLP  to D igitize  2.3  Million Lichen  and  Bryophyte Specimens

NLP: Duplicate Harvesting1. Extract collector data

a) Last name, number, date2. Harvest duplicates from

consortium DBa) Exact duplicatesb) Duplicate events

3. High similarity indexes4. OCR block comparison5. Consensus record

Page 11: Integrating OCR and NLP  to D igitize  2.3  Million Lichen  and  Bryophyte Specimens

NLP: Targeted Parsing Profiles1. Target similar label formats2. Use raw OCR to locate “Nash”

labels3. Targeted parsing algorithms4. Exclude:

a) Determined by Nashb) Author of scientific namec) Associated collectord) County

Page 12: Integrating OCR and NLP  to D igitize  2.3  Million Lichen  and  Bryophyte Specimens

Label Review

Page 13: Integrating OCR and NLP  to D igitize  2.3  Million Lichen  and  Bryophyte Specimens

Thank You

Michael Adamo Bruce Allen Meredith Blackwell Bill Buck Alina Freire-Fierro John Freudenstein Alan Fryday David Giblin Karen Hughes Steffi Ickert-Bond Timothy James Jennifer S. Kluse Matt Von Konrat Ben Legler Tatyana Livshultz

Robert Lücking Francois Lutzoni Bob Magill Andrew Miller Brent Mishler Donald Pfister Richard Rabeler Malcolm Sargent Edward Schilling Michaela Schmull Blanka Shaw Jon Shaw Carol Shearer Larry StClair Barbara Thiers

Funded by the NSF ADBC program