Oboyski cal bug_ecn_2012
-
Upload
ecnofficer -
Category
Technology
-
view
261 -
download
0
description
Transcript of Oboyski cal bug_ecn_2012
Digitizing California Arthropod Collections
Peter Oboyski, Gordon Nishida, Kipling Will, Rosemary GillespieEssig Museum of Entomology
University of CaliforniaBerkeley, California, USA
What is CalBug?
Essig Museum of Entomology
California Academy of Sciences
California State Collection of Arthropods
Bohart Museum, UC Davis
Entomology Research Museum, UC Riverside
San Diego Natural History Museum
LA County Museum
Santa Barbara Museum of Natural History
(Optional) Sort by locality, date, sex, etc.
Remove labels, add unique identifier
Replace labels, return to collection
Manually enter data into MySQL database
Online crowd-sourcing of manual data entry
Optical Character Recognition (OCR) &
Automated data parsing
Error checking
Geographic referencing
Aggregate data in online cache
Temporospatial analyses
Take digital image, name and save file
Digitization workflow
Handling & Imaging Data Capture Data Manipulation
Why Image Specimens/Labels?
• Data capture can be done remotely• Magnify difficult to read labels• Verbatim archive of label data
(Optional) Sort by locality, date, sex, etc.
Remove labels, add unique identifier
Replace labels, return to collection
Take digital image, name and save file
Handling & Imaging
Presorting allows faster databasing
Removing labels is quickAdding unique identifiers is slow
Efficient work station, file naming conventions and batch processing
Replacing labels takes time
1st generation - DinoLite digital microscope
2nd generation – Digital Camera (Canon G9)
High resolution - magnify hard to read labels
Labels flat, unobscured - better for OCR
Scale bar, controlled light
Important to add species name to image or file name
Digital camera Tethered to computer Labels removed
EMEC218958 Paracotalpa ursina.jpg
Scanning SlidesFlatbed scanner & Photoshop
Save for Web & Devices
IrfanView software for batch processing of image files
EMEC218958 Paracotalpa ursina.jpg
Manually enter data into MySQL database
Online crowd-sourcing of manual data entry
Optical Character Recognition (OCR) &
Automated data parsing
Data capture
Using our own MySQL database (EssigDB)Built-in error checkingData carry-over one record to nextTaxonomy automatically added
“Notes from Nature”Collaboration with ZooniverseCitizen Scientist transcription of labels
Collaboration with UC San DiegoImproved OCR and “word spotting”Automatic data parsing (not yet!!)- iDigBio “hackathon” in February for OCR
Genus and species from file name
Higher taxonomy auto-filledfrom database authority file
Notes from NatureCitizen Science data transcription
Integrating OCR with crowd sourcing
o Spotting words within imageso Copy-paste, highlight-drag fieldso Auto-detecting repeated “words”
o eg. species, states, countieso Providing an additional “vote” for
transcription consensus
The OCR challenge for specimen labels
DETECTION:Finding text in a complex matrixMachine-typed vs. hand-written labelsSliding window classifier creating text bounding boxes>95% detection and localization using pixel-overlap measures
RECOGNITION:
Using Tesseract OCR engine
Machine Type
74% accuracy for word-level
82% accuracy for character-level
Hand Writing
5.4% accuracy for word-level
9.2% accuracy for character-level
Current Progress in OCR recognition
Error checking
Geographic referencing
Aggregate data in online cache
Temporospatial analyses
Data Manipulation
Just starting this phase
No report on error rates
Georeferencing very slow even with semi-automation with GeoLocate and other services
Following Darwin Core standardsMerging of data straight forward
Analyses pending
Progress• After 2 years ... • Undergraduate student work force• Pinned specimens– imaging 20-65 specimens per hour (ave. = 40)
• Microscope slides– Imaging 100-170 specimens per hour (ave. = 140)
• Approximately 40,000 records databased– Plus 115,000 previously databased insect records
• 150,000+ images waiting to be databased
Thank you
http://calbug.berkeley.edu