From Data to Discovery

48
From Data to Discovery Building Automated Cataloguing Tools with Perl Huw Jones Cambridge University Library

description

From Data to Discovery. Building Automated Cataloguing Tools with Perl. Huw Jones Cambridge University Library. Cambridge. Small city, big University = lots of libraries!. Lots of libraries = lots of books. University Library: 3.85 M Other libraries: 2.5 M 8 databases. - PowerPoint PPT Presentation

Transcript of From Data to Discovery

Page 1: From Data to Discovery

From Data to Discovery

Building Automated Cataloguing Tools with Perl

Huw Jones

Cambridge University Library

Page 2: From Data to Discovery
Page 3: From Data to Discovery

Small city, big University = lots of libraries!

Cambridge

Page 4: From Data to Discovery
Page 5: From Data to Discovery
Page 6: From Data to Discovery
Page 7: From Data to Discovery
Page 8: From Data to Discovery

Lots of libraries = lots of books

Page 9: From Data to Discovery

Bibliographic records

University Library: 3.85 M

Other libraries: 2.5 M

8 databases

Page 10: From Data to Discovery

Data problems

Quality

Duplication

Page 11: From Data to Discovery

Quality - fullness

of 2.5 M records in our databases

1 M are short records

Page 12: From Data to Discovery

Quality – coding

Page 13: From Data to Discovery

Duplication

Page 14: From Data to Discovery

Effects

• Difficulty in resource discovery

• Patchy retrieval

• Lack of authority control

• Difficulty with standard deduplication

• Burden on staff time

• Ties us to multiple database model

Page 15: From Data to Discovery

Aims

Better records

Fewer records

Page 16: From Data to Discovery

Existing Solutions?

• Manual recataloguing

• Commercial solutions

• Universal catalogue

• Discovery layer

Either don’t solve the core problem, or expensive and/or time consuming

Page 17: From Data to Discovery

Our solution

Automated Cataloguing Tools!

• Short record enrichment• Automated MARC correction• Deduplication

Order important – full, well coded records are easier to deduplicate

Page 18: From Data to Discovery

General principles

• Retrieve some records from a Voyager database

• Examine and/or manipulate them

• If necessary, make changes in the database

N.B. Watch indexes and table space!

Page 19: From Data to Discovery

General tools

• Perl – holds everything together

• Perl DBI – connects to databases

• SQL – retrieves records from database

• MARC::Record modules (from CPAN) – to examine/manipulate records

• Pbulkimport/Batchcat – to make changes to the database

Page 20: From Data to Discovery

Batchcat vs Pbulkimport

• Batchcat – installed on PC with Voyager

• More versatile

• Can’t be used on server

• Pbulkimport – limited functionality

• Needs Bibliographic Detection Profile and Bulk Import Rule (SYSADMIN)

• Can be used on server

Page 21: From Data to Discovery

Books

• Learning Perl / Randal L. Schwartz and Tom Phoenix. 3rd ed. (Sebastopol, Calif. : O’Reilly, 2001). ISBN: 0596001320

• Programming the Perl DBI / Alligator Descartes and Tim Bunce. (Sebastopol, Calif. : O’Reilly, 2000). ISBN: 1565926994

Page 22: From Data to Discovery

Enriching short records

How to get from this …

Page 23: From Data to Discovery

to this

Page 24: From Data to Discovery

Basic mechanism

• Take short record

• Find a matching full record

• Overlay short record with full record

• Need a source of full records

• In Cambridge - University Library - large database of full, authority controlled records

Page 25: From Data to Discovery

Connects to EXTERNAL source. Finds best FULL RECORD match and scores it

Connects to LOCAL database and checks if a valid bib id

Retrieves SHORT RECORD info from local database

File of SHORT RECORD bib ids

Compares match score to overlay threshold. If OK, retrieves MARC record for FULL RECORD

Corrects FULL MARC record. Removes inappropriate fields. Inserts fields to be retained from SHORT RECORD

In local database overlays SHORT RECORD with FULL RECORD

Page 26: From Data to Discovery

Output

Page 27: From Data to Discovery

Interface

Page 28: From Data to Discovery

Results

• Service has been running for 1 year (much of which was testing)

• 18 libraries subscribed to use service

• 90,000 short records upgraded

Page 29: From Data to Discovery

MARC checking and correction

• Bibliographic standard – agreed minimum standard for cataloguing

• Every week, libraries receive an automatically generated file of MARC coding errors for correction

• Based on MARC::Lint module with many alterations

Page 30: From Data to Discovery

Output

Page 31: From Data to Discovery

Mechanism

• Connects to database using Perl DBI• Retrieves MARC record for records

created/edited in last week• Runs them through MARC check• Prints errors to file• Emails file to library

Over 100,000 errors pointed out so far!

Page 32: From Data to Discovery

MARC Correction

How to get from this …

• =LDR 00472nam\\2200157\a\4500• =001 662002• =005 20071205064734.0• =008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d• =020 \\$a9780961751111• =100 1\$aBroecker, W.S.,$d1931-• =245 10$aHow to build a habitable planet ;$cBy Wallace S. Broecker.• =260 \\$aNew York ;$bEldigio Press,$cc1985• =300 \\$a291p $bill $c23cm• =504 \\$aIncludes index.• =650 \0$aAstronomy.• =650 \0$aAstrophysics.

Page 33: From Data to Discovery

to this!

• =LDR 00453nam 2200157 a 4500• =001 662002• =005 20071205064734.0• =008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d• =020 \\$a9780961751111• =100 1\$aBroecker, W. S.,$d1931-• =245 10$aHow to build a habitable planet /$cby Wallace S. Broecker.• =260 \\$aNew York :$bEldigio Press,$cc1985.• =300 \\$a291 p. :$bill. ;$c23 cm.• =504 \\$aIncludes index.• =650 \0$aAstronomy.• =650 \0$aAstrophysics.

Page 34: From Data to Discovery

MARC Correction

• Version of module which, where there is no ambiguity, corrects errors

• Built into short record upgrade program

• Also offered as a retrospective service to clean up legacy records

• Possibility of building it into weekly check

Page 35: From Data to Discovery

Mechanism

• Connects to database using Perl DBI

• Retrieves full MARC record

• Runs against correction module

• Replaces corrected record in database

Page 36: From Data to Discovery

Output

• Bib id: 662002• How to build a habitable planet ; By Wallace S. Broecker.• 100: UPDATE: Spaces inserted between initials in subfield _a• 245: UPDATE: By uncapitalised at start of subfield c• 245: UPDATE: Space forward slash inserted before subfield _c• 260: UPDATE: Full stop inserted at end of field• 260: UPDATE: Space colon inserted before subfield _b• 300: UPDATE: Full stop inserted after the p in pagination• 300: UPDATE: Full stop inserted at end of field• 300: UPDATE: Illustration abbreviation has been corrected• 300: UPDATE: Space colon inserted before subfield _b• 300: UPDATE: Space inserted between digits and cm• 300: UPDATE: Space inserted between digits and p in pagination• 300: UPDATE: Space semi-colon inserted before subfield c

Page 37: From Data to Discovery

Results

• In testing 70,000 records processed

• Corrected over 200,000 MARC coding errors

• May run ALL our existing records through at some stage

Page 38: From Data to Discovery

Deduplication – in progress!

Three stages:

• Identification of groups of duplicates

• Identification/construction of ‘best’ record

• Deletion of other records – relinking of holdings/items/Purchase Orders to ‘best record’

Page 39: From Data to Discovery

Identification of duplicates

• Connect to a database with Perl DBI

• Use SQL to retrieve records

• For each record, retrieve all available data from tables

• Use matching algorithm to identify groups of duplicates

Page 40: From Data to Discovery

And you’ll end up with something like this:

Page 41: From Data to Discovery

Identification of best record

• For each of group of duplicates, MARC records retrieved

• Passed to scoring algorithm

• Record with highest score forms basis of ‘best’ record

• Retains set fields (i.e. subject headings) from ‘other’ records

• Corrects any MARC coding errors

Page 42: From Data to Discovery
Page 43: From Data to Discovery
Page 44: From Data to Discovery
Page 45: From Data to Discovery
Page 46: From Data to Discovery

But …

• No relinking functionality, even in BatchCat

• No viable workaround for libraries using Acquisitions/without losing circulation history

Page 47: From Data to Discovery

In conclusion …

• Tools for librarians, not replacements!

• Do the stuff programs do well, allowing humans to concentrate on what humans do well

• Won’t do all the work, just makes a solution to major data problems feasible

Page 48: From Data to Discovery

Questions?