Post on 17-Dec-2015
From Data to Discovery
Building Automated Cataloguing Tools with Perl
Huw Jones
Cambridge University Library
Effects
• Difficulty in resource discovery
• Patchy retrieval
• Lack of authority control
• Difficulty with standard deduplication
• Burden on staff time
• Ties us to multiple database model
Existing Solutions?
• Manual recataloguing
• Commercial solutions
• Universal catalogue
• Discovery layer
Either don’t solve the core problem, or expensive and/or time consuming
Our solution
Automated Cataloguing Tools!
• Short record enrichment• Automated MARC correction• Deduplication
Order important – full, well coded records are easier to deduplicate
General principles
• Retrieve some records from a Voyager database
• Examine and/or manipulate them
• If necessary, make changes in the database
N.B. Watch indexes and table space!
General tools
• Perl – holds everything together
• Perl DBI – connects to databases
• SQL – retrieves records from database
• MARC::Record modules (from CPAN) – to examine/manipulate records
• Pbulkimport/Batchcat – to make changes to the database
Batchcat vs Pbulkimport
• Batchcat – installed on PC with Voyager
• More versatile
• Can’t be used on server
• Pbulkimport – limited functionality
• Needs Bibliographic Detection Profile and Bulk Import Rule (SYSADMIN)
• Can be used on server
Books
• Learning Perl / Randal L. Schwartz and Tom Phoenix. 3rd ed. (Sebastopol, Calif. : O’Reilly, 2001). ISBN: 0596001320
• Programming the Perl DBI / Alligator Descartes and Tim Bunce. (Sebastopol, Calif. : O’Reilly, 2000). ISBN: 1565926994
Basic mechanism
• Take short record
• Find a matching full record
• Overlay short record with full record
• Need a source of full records
• In Cambridge - University Library - large database of full, authority controlled records
Connects to EXTERNAL source. Finds best FULL RECORD match and scores it
Connects to LOCAL database and checks if a valid bib id
Retrieves SHORT RECORD info from local database
File of SHORT RECORD bib ids
Compares match score to overlay threshold. If OK, retrieves MARC record for FULL RECORD
Corrects FULL MARC record. Removes inappropriate fields. Inserts fields to be retained from SHORT RECORD
In local database overlays SHORT RECORD with FULL RECORD
Results
• Service has been running for 1 year (much of which was testing)
• 18 libraries subscribed to use service
• 90,000 short records upgraded
MARC checking and correction
• Bibliographic standard – agreed minimum standard for cataloguing
• Every week, libraries receive an automatically generated file of MARC coding errors for correction
• Based on MARC::Lint module with many alterations
Mechanism
• Connects to database using Perl DBI• Retrieves MARC record for records
created/edited in last week• Runs them through MARC check• Prints errors to file• Emails file to library
Over 100,000 errors pointed out so far!
MARC Correction
How to get from this …
• =LDR 00472nam\\2200157\a\4500• =001 662002• =005 20071205064734.0• =008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d• =020 \\$a9780961751111• =100 1\$aBroecker, W.S.,$d1931-• =245 10$aHow to build a habitable planet ;$cBy Wallace S. Broecker.• =260 \\$aNew York ;$bEldigio Press,$cc1985• =300 \\$a291p $bill $c23cm• =504 \\$aIncludes index.• =650 \0$aAstronomy.• =650 \0$aAstrophysics.
to this!
• =LDR 00453nam 2200157 a 4500• =001 662002• =005 20071205064734.0• =008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d• =020 \\$a9780961751111• =100 1\$aBroecker, W. S.,$d1931-• =245 10$aHow to build a habitable planet /$cby Wallace S. Broecker.• =260 \\$aNew York :$bEldigio Press,$cc1985.• =300 \\$a291 p. :$bill. ;$c23 cm.• =504 \\$aIncludes index.• =650 \0$aAstronomy.• =650 \0$aAstrophysics.
MARC Correction
• Version of module which, where there is no ambiguity, corrects errors
• Built into short record upgrade program
• Also offered as a retrospective service to clean up legacy records
• Possibility of building it into weekly check
Mechanism
• Connects to database using Perl DBI
• Retrieves full MARC record
• Runs against correction module
• Replaces corrected record in database
Output
• Bib id: 662002• How to build a habitable planet ; By Wallace S. Broecker.• 100: UPDATE: Spaces inserted between initials in subfield _a• 245: UPDATE: By uncapitalised at start of subfield c• 245: UPDATE: Space forward slash inserted before subfield _c• 260: UPDATE: Full stop inserted at end of field• 260: UPDATE: Space colon inserted before subfield _b• 300: UPDATE: Full stop inserted after the p in pagination• 300: UPDATE: Full stop inserted at end of field• 300: UPDATE: Illustration abbreviation has been corrected• 300: UPDATE: Space colon inserted before subfield _b• 300: UPDATE: Space inserted between digits and cm• 300: UPDATE: Space inserted between digits and p in pagination• 300: UPDATE: Space semi-colon inserted before subfield c
Results
• In testing 70,000 records processed
• Corrected over 200,000 MARC coding errors
• May run ALL our existing records through at some stage
Deduplication – in progress!
Three stages:
• Identification of groups of duplicates
• Identification/construction of ‘best’ record
• Deletion of other records – relinking of holdings/items/Purchase Orders to ‘best record’
Identification of duplicates
• Connect to a database with Perl DBI
• Use SQL to retrieve records
• For each record, retrieve all available data from tables
• Use matching algorithm to identify groups of duplicates
Identification of best record
• For each of group of duplicates, MARC records retrieved
• Passed to scoring algorithm
• Record with highest score forms basis of ‘best’ record
• Retains set fields (i.e. subject headings) from ‘other’ records
• Corrects any MARC coding errors
But …
• No relinking functionality, even in BatchCat
• No viable workaround for libraries using Acquisitions/without losing circulation history
In conclusion …
• Tools for librarians, not replacements!
• Do the stuff programs do well, allowing humans to concentrate on what humans do well
• Won’t do all the work, just makes a solution to major data problems feasible