Integrating OCR and NLP to D igitize 2.3 Million Lichen and Bryophyte Specimens

Post on 09-Feb-2016

19 views 0 download

Tags:

description

Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin. Integrating OCR and NLP to D igitize 2.3 Million Lichen and Bryophyte Specimens. Goals and Scope. NSF ADBC (#1115116) ~ 2.3 million specimen 90% of all specimens 900,000 lichens 1.4 million bryophytes - PowerPoint PPT Presentation

Transcript of Integrating OCR and NLP to D igitize 2.3 Million Lichen and Bryophyte Specimens

Integrating OCR and NLP to Digitize 2.3 Million Lichen and Bryophyte Specimens

Edward GilbertCorinna GriesThomas H. Nash IIIRobert Anglin

Goals and Scope NSF ADBC (#1115116) ~ 2.3 million specimen

90% of all specimens 900,000 lichens 1.4 million bryophytes

> 60 non-governmental US herbaria (95%) Mexico, US, Canada

16 digitization centers

Digitization Workflow

National Portals

Lichen Consortium http://lichenportal.org 34 Collections 902,664 Records

Bryophyte Consortium http://bryophyteportal/ 26 Collections 1,300,135 Records

Symbiota software

Imaging Stage

Capture Image

barcode in file name

Create Skeleton

Filespecies name,

country, state,

exsiccati, etc.

Upload to FTP server

Image processing

extract barcode,

create web versions, map to portal DBs

Herbarium Database

Automated OCR

Tesseract, ABBYY

Existing Record

simply link image

Upload to FTP server

Image URLs

Manage Specimen

Data in Portal

Manage / Review

Records in Portal

SymbiotaEditor

review, edit, keystroke

Create New Record

barcode, image, skeletal data

Automated NLP

Darwin Core Parsing

Automated OCR

1. Iterate through “unprocessed” images

2. OCR via Tesseract (version 3)a) In focus, good lighting, minimal noiseb) Resolution: >20px x-height

3. Database raw text block4. Progress to next step

1. Low OCR return => hand processing2. Natural Language Processing

OCR Challenges

Issues Old fonts Faded labels Form labels Handwritten

labels Specialized terms

Solutions Image

treatments OCR tuning Dictionaries Consensus OCR

¢_].L.|»‘¢ .'».f.'._..‘~,(.Jfin-x‘*\'a:"511z:1 wf .~\:'i/.onli State UniversityP.’~.r"~2= ,_. gg J:.2 " J*J*" †(=:\‘-“ax "»..'\-12�‘ “ "‘ ;T~;‘~7i?»-1_1_\f;>sf`;,' ESXZ»ie+‘-». “~'.»te;~:i_.t<» ff`t;~f3":.f.“» »4 xx, ,"""‘“â€T"’ <1;-.rs f3'a,1.z>.t;;a¢f~rus ’�V4 J 'if . r°'° M '1?nies ivain.) Sav.neutal Station - " '1 ~»r';;4-\P ` 1.T11 ./P.. ,J ..-.ELEV. ' `.fJL_\ LATL Q _‘ 1 _ Y’ DATE_ ,. W5. (> f- , -:‘; i f>i_T ~~ . A 1:». v\ .-v »~. 4. a xvala 8/27/73

PLANTS OF NEW r~1ExIcoHerbarium of Arizona State UniversityParmelia ulophyllodes (Vain.) Sav.COUNTY “°â€â€œâ€œ �Joranada Experimental Station -New Mexico State University"“““' on JuniperusELEV. ‘ 4400EEILLEETUR DATEDU T. H. Nash #7914 8/27/73T. H. N.

Automated NLP

1. Iterate through raw OCR text blocks

2. Parse text block1. Darwin Core 2. Populate database

3. Review1. Adjust content2. Approve3. Handwritten => keystroke

NLP Challenges

Issues Variable layouts Loose standards OCR error

Solutions Authority tables Levenshtein

distance Word stats Format

recognition Parsing profiles Duplicate

harvesting

NLP: Duplicate Harvesting1. Extract collector data

a) Last name, number, date2. Harvest duplicates from

consortium DBa) Exact duplicatesb) Duplicate events

3. High similarity indexes4. OCR block comparison5. Consensus record

NLP: Targeted Parsing Profiles1. Target similar label formats2. Use raw OCR to locate “Nash”

labels3. Targeted parsing algorithms4. Exclude:

a) Determined by Nashb) Author of scientific namec) Associated collectord) County

Label Review

Thank You

Michael Adamo Bruce Allen Meredith Blackwell Bill Buck Alina Freire-Fierro John Freudenstein Alan Fryday David Giblin Karen Hughes Steffi Ickert-Bond Timothy James Jennifer S. Kluse Matt Von Konrat Ben Legler Tatyana Livshultz

Robert Lücking Francois Lutzoni Bob Magill Andrew Miller Brent Mishler Donald Pfister Richard Rabeler Malcolm Sargent Edward Schilling Michaela Schmull Blanka Shaw Jon Shaw Carol Shearer Larry StClair Barbara Thiers

Funded by the NSF ADBC program