VRA 2012, Cataloging Case Studies, ROBOCATALOGING
-
Upload
visual-resources-association -
Category
Technology
-
view
487 -
download
0
description
Transcript of VRA 2012, Cataloging Case Studies, ROBOCATALOGING
Joshua PolanskyUniversity of Washington
College of Built EnvironmentsVisual Resources Collection
ROBOCATALOGINGAccelerated workflows using OCR and automation
Cataloging Case Studies April 21, 2012
University of Washington College of Built EnvironmentsVisual Resources Collection
Serves the departments of Architecture, Construction Management,Landscape Architecture and Urban Design & Planning
Analog collection:• 130,000 35mm slides accessioned and cataloged since 1950s• Typewritten records; no digital database or online component until 2002
Visual Resources CollectionDigital components:
MS Access database catalog MDID2 for faculty / students
The big question:
Automated processes exist for batch digitizing analog photos.
The big question:
Automated processes exist for batch digitizing analog photos.
Is it possible to batch digitize old cataloging data, too?
Good cataloging information here, researched and typed years ago.
More good data, including source and a unique accession number.
Paper records to the rescue
Binders and binders of accession records Pristine label photocopies
Accession numberCollection ID that appears on every label in this form
Architect
Building name
Location / Year
View
Source
Photocopied label edge that will interfere with OCR later
A closer look at the slide label
The big challenge:
• Digitize these typewritten pages• Sort slide label text into distinct columns in Excel• Identify each record with its accession number• Do it all with common or affordable tools
Photo: Alvaro Farfán via Flickr. 3392225359
Hardware
Photo: Alvaro Farfán via Flickr. 3392225359
Apple iMac• 2010 model• OS 10.6
Any recent Mac will do (OS 10.4 or higher)
Hardware
Photo: Alvaro Farfán via Flickr. 3392225359
Epson Perfection V500 scanner• With optional Automatic Document
Feeder for stacks of 30 sheets at a time• Standard transparency unit makes it
useful for other scanning projects• Retails for less than $300 with ADF
Photo: Zak Moreira via Flickr. 3425393424
Software
Photo: Zak Moreira via Flickr. 3425393424
Adobe Photoshop CS4• Resize and realign scanned page into a
single-column tif with Actions
Adobe Acrobat Pro• Create a pdf of each tif• Analyze pdf with optical character recognition
(OCR) and make pdf text selectable
Apple AutomatorAutomator Virtual Input• Execute workflows to control multiple
applications. Launch, copy, paste, manipulate, save, repeat.
• Create Folder Actions for Finder automation• Virtual Input: Extend the functionality of
Automator for even more control over apps, mouse, keyboard
Microsoft Excel 2008• Receive text from Acrobat in columns• After text manipulation and sorting, output
in a cross-platform format like csv
Automator
• Comes standard with Mac OS X 10.4+
• Allows scripting and workflow creation via GUI
• Can perform operations within an application or across multiple applications
Document scanning: Automator, Folder Actions, Photoshop[video here in original presentataion]
Text processing: Automator + Automator Virtual Input, Folder Actions, Acrobat, Excel[video here in original presentataion]
Processed output in Excel
Sometimes it looks good...
Sometimes it doesn’t.
Sometimes it looks good...
Final result after text sorting and cleanup
Goal• Produce nearly perfect metadata,
clean enough to import into existing database
Goal• Produce nearly perfect metadata,
clean enough to import into existing database
Actual outcome• Produced pretty good metadata• Spent lots of time on data cleanup
to get there
Goal• Use tools on hand; any new tools
should be cheap or useful for other projects
Actual outcome• Used standard software, plus one
new application ($25)• iMac is a student workstation• Epson scanner is in use for print
and film scanning plus pdf creation
Goal• Use tools on hand; any new tools
should be cheap or useful for other projects
Goal• Have 75,000 new records ready
to pair with images and publish to MDID
Goal• Have 75,000 new records ready
to pair with images and publish to MDID
Actual outcome• Got 75,000 records!• Created a searchable shelf list and
archival finding aid• With further data cleanup, the
original goal of MDID use can be achieved
Photo: JF Sebastian via Flickr. 412874324
Photo: JF Sebastian via Flickr. 412874324
• Every Mac comes with Automator and it is easy to learn
• You probably have OCR tools on your computer right now
• Experimenting can produce great results
Photo: JF Sebastian via Flickr. 412874324
Photo credits• Software icons and screenshots by Adobe, Apple,
Microsoft and Singed Labcoat• Kraftwerk images by Flickr users Zak Moreira,
Alvaro Farfán and JF Sebastian• Other photo and video by UW CBE VRC
Thank youRainer Metzger University of Washington
• Every Mac comes with Automator and it is easy to learn
• You probably have OCR tools on your computer right now
• Experimenting can produce great results