Oboyski cal bug_ecn_2012

Digitizing California Arthropod Collections

Peter Oboyski, Gordon Nishida, Kipling Will, Rosemary GillespieEssig Museum of Entomology

University of CaliforniaBerkeley, California, USA

What is CalBug?

Essig Museum of Entomology

California Academy of Sciences

California State Collection of Arthropods

Bohart Museum, UC Davis

Entomology Research Museum, UC Riverside

San Diego Natural History Museum

LA County Museum

Santa Barbara Museum of Natural History

http://www.calacademy.org/

(Optional) Sort by locality, date, sex, etc.

Remove labels, add unique identifier

Replace labels, return to collection

Manually enter data into MySQL database

Online crowd-sourcing of manual data entry

Optical Character Recognition (OCR) &

Automated data parsing

Error checking

Geographic referencing

Aggregate data in online cache

Temporospatial analyses

Take digital image, name and save file

Digitization workflow

Handling & Imaging Data Capture Data Manipulation

Why Image Specimens/Labels?

• Data capture can be done remotely• Magnify difficult to read labels• Verbatim archive of label data

(Optional) Sort by locality, date, sex, etc.

Remove labels, add unique identifier

Replace labels, return to collection

Take digital image, name and save file

Handling & Imaging

Presorting allows faster databasing

Removing labels is quickAdding unique identifiers is slow

Efficient work station, file naming conventions and batch processing

Replacing labels takes time

1st generation - DinoLite digital microscope

2nd generation – Digital Camera (Canon G9)

High resolution - magnify hard to read labels

Labels flat, unobscured - better for OCR

Scale bar, controlled light

Important to add species name to image or file name

Digital camera Tethered to computer Labels removed

EMEC218958 Paracotalpa ursina.jpg

Scanning SlidesFlatbed scanner & Photoshop

Save for Web & Devices

IrfanView software for batch processing of image files

EMEC218958 Paracotalpa ursina.jpg

Manually enter data into MySQL database

Online crowd-sourcing of manual data entry

Optical Character Recognition (OCR) &

Automated data parsing

Data capture

Using our own MySQL database (EssigDB)Built-in error checkingData carry-over one record to nextTaxonomy automatically added

“Notes from Nature”Collaboration with ZooniverseCitizen Scientist transcription of labels

Collaboration with UC San DiegoImproved OCR and “word spotting”Automatic data parsing (not yet!!)- iDigBio “hackathon” in February for OCR

Genus and species from file name

Higher taxonomy auto-filledfrom database authority file

Notes from NatureCitizen Science data transcription

Integrating OCR with crowd sourcing

o Spotting words within imageso Copy-paste, highlight-drag fieldso Auto-detecting repeated “words”

o eg. species, states, countieso Providing an additional “vote” for

transcription consensus

The OCR challenge for specimen labels

DETECTION:Finding text in a complex matrixMachine-typed vs. hand-written labelsSliding window classifier creating text bounding boxes>95% detection and localization using pixel-overlap measures

RECOGNITION:

Using Tesseract OCR engine

Machine Type

74% accuracy for word-level

82% accuracy for character-level

Hand Writing

5.4% accuracy for word-level

9.2% accuracy for character-level

Current Progress in OCR recognition

Error checking

Geographic referencing

Aggregate data in online cache

Temporospatial analyses

Data Manipulation

Just starting this phase

No report on error rates

Georeferencing very slow even with semi-automation with GeoLocate and other services

Following Darwin Core standardsMerging of data straight forward

Analyses pending

Progress• After 2 years ... • Undergraduate student work force• Pinned specimens– imaging 20-65 specimens per hour (ave. = 40)

• Microscope slides– Imaging 100-170 specimens per hour (ave. = 140)

• Approximately 40,000 records databased– Plus 115,000 previously databased insect records

• 150,000+ images waiting to be databased

Thank you

http://calbug.berkeley.edu

Oboyski cal bug_ecn_2012

Technology

Transcript of Oboyski cal bug_ecn_2012