Oboyski cal bug_ecn_2012

24
Digitizing California Arthropod Collections Peter Oboyski, Gordon Nishida, Kipling Will, Rosemary Gillespie Essig Museum of Entomology University of California Berkeley, California, USA

description

 

Transcript of Oboyski cal bug_ecn_2012

Page 1: Oboyski cal bug_ecn_2012

Digitizing California Arthropod Collections

Peter Oboyski, Gordon Nishida, Kipling Will, Rosemary GillespieEssig Museum of Entomology

University of CaliforniaBerkeley, California, USA

Page 2: Oboyski cal bug_ecn_2012

What is CalBug?

Essig Museum of Entomology

California Academy of Sciences

California State Collection of Arthropods

Bohart Museum, UC Davis

Entomology Research Museum, UC Riverside

San Diego Natural History Museum

LA County Museum

Santa Barbara Museum of Natural History

Page 3: Oboyski cal bug_ecn_2012
Page 4: Oboyski cal bug_ecn_2012

(Optional) Sort by locality, date, sex, etc.

Remove labels, add unique identifier

Replace labels, return to collection

Manually enter data into MySQL database

Online crowd-sourcing of manual data entry

Optical Character Recognition (OCR) &

Automated data parsing

Error checking

Geographic referencing

Aggregate data in online cache

Temporospatial analyses

Take digital image, name and save file

Digitization workflow

Handling & Imaging Data Capture Data Manipulation

Page 5: Oboyski cal bug_ecn_2012

Why Image Specimens/Labels?

• Data capture can be done remotely• Magnify difficult to read labels• Verbatim archive of label data

Page 6: Oboyski cal bug_ecn_2012

(Optional) Sort by locality, date, sex, etc.

Remove labels, add unique identifier

Replace labels, return to collection

Take digital image, name and save file

Handling & Imaging

Presorting allows faster databasing

Removing labels is quickAdding unique identifiers is slow

Efficient work station, file naming conventions and batch processing

Replacing labels takes time

Page 7: Oboyski cal bug_ecn_2012

1st generation - DinoLite digital microscope

Page 8: Oboyski cal bug_ecn_2012
Page 9: Oboyski cal bug_ecn_2012

2nd generation – Digital Camera (Canon G9)

Page 10: Oboyski cal bug_ecn_2012

High resolution - magnify hard to read labels

Labels flat, unobscured - better for OCR

Scale bar, controlled light

Important to add species name to image or file name

Digital camera Tethered to computer Labels removed

EMEC218958 Paracotalpa ursina.jpg

Page 11: Oboyski cal bug_ecn_2012

Scanning SlidesFlatbed scanner & Photoshop

Page 12: Oboyski cal bug_ecn_2012

Save for Web & Devices

Page 13: Oboyski cal bug_ecn_2012

IrfanView software for batch processing of image files

EMEC218958 Paracotalpa ursina.jpg

Page 14: Oboyski cal bug_ecn_2012

Manually enter data into MySQL database

Online crowd-sourcing of manual data entry

Optical Character Recognition (OCR) &

Automated data parsing

Data capture

Using our own MySQL database (EssigDB)Built-in error checkingData carry-over one record to nextTaxonomy automatically added

“Notes from Nature”Collaboration with ZooniverseCitizen Scientist transcription of labels

Collaboration with UC San DiegoImproved OCR and “word spotting”Automatic data parsing (not yet!!)- iDigBio “hackathon” in February for OCR

Page 15: Oboyski cal bug_ecn_2012
Page 16: Oboyski cal bug_ecn_2012

Genus and species from file name

Higher taxonomy auto-filledfrom database authority file

Page 17: Oboyski cal bug_ecn_2012

Notes from NatureCitizen Science data transcription

Page 18: Oboyski cal bug_ecn_2012
Page 19: Oboyski cal bug_ecn_2012

Integrating OCR with crowd sourcing

o Spotting words within imageso Copy-paste, highlight-drag fieldso Auto-detecting repeated “words”

o eg. species, states, countieso Providing an additional “vote” for

transcription consensus

Page 20: Oboyski cal bug_ecn_2012

The OCR challenge for specimen labels

DETECTION:Finding text in a complex matrixMachine-typed vs. hand-written labelsSliding window classifier creating text bounding boxes>95% detection and localization using pixel-overlap measures

Page 21: Oboyski cal bug_ecn_2012

RECOGNITION:

Using Tesseract OCR engine

Machine Type

74% accuracy for word-level

82% accuracy for character-level

Hand Writing

5.4% accuracy for word-level

9.2% accuracy for character-level

Current Progress in OCR recognition

Page 22: Oboyski cal bug_ecn_2012

Error checking

Geographic referencing

Aggregate data in online cache

Temporospatial analyses

Data Manipulation

Just starting this phase

No report on error rates

Georeferencing very slow even with semi-automation with GeoLocate and other services

Following Darwin Core standardsMerging of data straight forward

Analyses pending

Page 23: Oboyski cal bug_ecn_2012

Progress• After 2 years ... • Undergraduate student work force• Pinned specimens– imaging 20-65 specimens per hour (ave. = 40)

• Microscope slides– Imaging 100-170 specimens per hour (ave. = 140)

• Approximately 40,000 records databased– Plus 115,000 previously databased insect records

• 150,000+ images waiting to be databased

Page 24: Oboyski cal bug_ecn_2012

Thank you

http://calbug.berkeley.edu