Post on 27-Jun-2015
description
Slinging Data: Data Loading and Cleanup in Evergreen
Growing Evergreen Conference
22 April 2010
To migrate data …
Extract from the old, map and load into the new, clean up along the way, and keep
the auditor happy.
Whence
Extract data in a convenient form:
• Sometimes that means whatever you can get
• But better is
• MARC
• Flat text
• XML
All over the map
• Map entities
• Map fields
• Map values
• Map policies
All over the map
• Entities
• What is an item?
• What is a patron?
• Fields
• Where does the patron PIN come from?
All over the map
• Values
• Legacy item types• 0
• 1
• 45
• 123
• 234
Quick: which is the one for journal loan?
All over the map
Legacy Item Type Circ Modifier
0 Regular
1 Media
45 AV
123 Reference
234 Reference
Cleaning up
What?
• Bad data
• Ancient data
• Data it is too expensive to deal with later
When?
• Extract
• Load
• Post-load
Don’t box me in!
• The case of the dreaded double-encoding
• The even more dreadful case of the duplicitous and multiplicitous character encoding
Yes, those fixed fields really matter
The purpose of every modern ILS and discovery layer …
Yes, those fixed fields really matter
… is to point out every fixed field coding error in a form convenient for catalogers to identify and
fix.
Fixed fields
Oops!
create or replace function m_foo.set_leader (TEXT, INT, TEXT) RETURNS TEXT AS $$
my ($marcxml, $pos, $value) = @_;
use MARC::Record; use MARC::File::XML;
my $xml = $marcxml; eval { my $marc = MARC::Record->new_from_xml($marcxml, 'UTF-8'); my $leader = $marc->leader(); substr($leader, $pos, 1) = $value; $marc->leader($leader); $xml = $marc->as_xml_record; $xml =~ s/^<\?.+?\?>$//mo; $xml =~ s/\n//sgo; $xml =~ s/>\s+</></sgo; }; return $xml;$$ LANGUAGE PLPERLU STABLE;
On stage
Postgres lets us create an elegant mechanism for staging data to be loaded into an Evergreen database:
• Table inheritance
• Sequences
On stage
We want to be able to
• Load and manipulate the data
• … using every tool on our belt
• … while ensuring that it doesn’t show up in production until it’s ready (and we’re ready)
On stage
• Make a separate schema
psql> create schema m_foo;
• Mirror a real table
create table m_foo.asset_copy …
On stage
• Use the sequence
…id bigint not null default nextval('asset.copy_id_seq'::regclass)…
On stage
• Make space for the legacy
create table m_foo.asset_copy_legacy (
l_call_number TEXT
inherits (m_foo.asset_copy);
On stage
• Munge
• Munge
• Munge some more, then …
• Insert into production:
insert into asset.copy
select * from m_foo.asset_copy;
Counting
Who is the auditor?
It is you … and your patrons … and maybe even an actual auditor.
Counting
• Count what matters
• Number of records
• Number of dollars
• Number of things you’ll have to fix manually
• Don’t count what doesn’t matter
• Header rows
• Junk
Counting
• Count early and often
• Conservation of library data is Newton’s 42nd law!
Tools
• The usual suspects
• MARC::Record (or pymarc, or ruby-marc, or …)
• MARCEdit
• yaz-marcdump
• Spreadsheets
And now something new
Equinox Migration Tools
What?
MARC processing
Non-MARC processing
And more …
Where?
git://git.esilibrary.com/git/migration-tools.git
Thanks!
Galen Charlton
VP for Data Services, Equinox Software Inc.
gmc@esilibrary.com