10% Wrong 90% Wrong

44
10% Wrong, 90% Done Rogan Hamby South Carolina State Library rhamby@statelibrary .sc.gov Shasta Brewer York County Library Shasta.brewer@yclibrar y.net A practical approach to bibliographic de-duplication.

description

2011 Evergreen International Conference presentation on our MARC de-duplication project.

Transcript of 10% Wrong 90% Wrong

Page 1: 10% Wrong 90% Wrong

10% Wrong, 90% Done

Rogan HambySouth Carolina State [email protected]

Shasta Brewer York County Library [email protected]

A practical approach to bibliographic de-duplication.

Page 2: 10% Wrong 90% Wrong

Made Up Words

When I say ‘deduping’ I mean

‘MARC record de-duplication’

Page 3: 10% Wrong 90% Wrong

The Melting Pot

We were ten library systems with no standard source of MARC records.

We came from five ILSes.

Each had its own needs and workflow.

The MARC records reflected that.

Page 4: 10% Wrong 90% Wrong

Over 2,000,000 Records

Ten library systems joined in three waves.

Wave 1

Wave 2

Wave 3

0

500000

1000000

1500000

2000000

2500000

Page 5: 10% Wrong 90% Wrong

Early Effort

During each wave we ran a deduping script.

The script functioned as designed, however its matches were too few for our needs.

Page 6: 10% Wrong 90% Wrong

100% Accurate

It had a very high standard for creating matches.

No bad merges were created.

Page 7: 10% Wrong 90% Wrong

Service Issue

When a patron searched the catalog it was messy.

Page 8: 10% Wrong 90% Wrong

This caused problems with searching and placing holds.

Page 9: 10% Wrong 90% Wrong

It’s All About the TCNs

Why was this happening?

Because identical items were divided among multiple similar bib records with distinct fingerprints due to coming from multiple

sources.

Page 10: 10% Wrong 90% Wrong

Time for the Cleaning Gloves

In March 2009 we began discussing the issue with ESI. The low merging rate was due to the very precise and conservative finger printing of

the deduping process. In true open source spirit we decided to roll

our own solution and start cleaning up the database.

Page 11: 10% Wrong 90% Wrong

Finger Printing

Finger printing is identifying a unique MARC record by its

properties.

Page 12: 10% Wrong 90% Wrong

As finger printing identifies unique records it was of

limited use since our records came from many

sources.

Page 13: 10% Wrong 90% Wrong

A Disclaimer

The initial deduping, as designed, was very accurate. It emphasized avoiding imprecise

matches.

We decided that we had different priorities and were willing to make compromises.

Page 14: 10% Wrong 90% Wrong

MARC Crimes Unit

We decided to go past finger printing and build profiles based on broad MARC attributes.

Page 15: 10% Wrong 90% Wrong

Project Goals

Improve Searching

Faster Holds Filling

Page 16: 10% Wrong 90% Wrong

The Team

Shasta Brewer – York County

Lynn Floyd – Anderson County

Rogan Hamby – Florence County / State Library

Page 17: 10% Wrong 90% Wrong

The Mess

2,048,936 bib records

Page 18: 10% Wrong 90% Wrong

On Changes

During the development process a lot changed from early discussion to implementation.

We weighed decisions heavily on the side of needing to have a significant and practical

impact on the catalog.

I watch the ripples change their size / But never leave the stream - David Bowie, Changes

Page 19: 10% Wrong 90% Wrong

Modeling the Data

Determining match points determines the scope of the record set you create mergers from.

Due to lack of uniformity in records, matching became extremely important. Adding a single extra limiting

match point caused high percentage drops in possible matches reducing the effectiveness of the project.

Page 20: 10% Wrong 90% Wrong

Tilting at Windmills

We refused to believe that the highest priority for deduping should be avoiding bad matches.The highest priority is creating the maximum

positive impact on the catalog.

Many said we were a bit mad. Fortunately, we took it as a complement.

Page 21: 10% Wrong 90% Wrong

We ran extensive reports to model the bib data.

A risky and non-conventional model was proposed.

Although we kept trying other models, the benefit of large matches using the risky model

made it too compelling to discard.

Page 22: 10% Wrong 90% Wrong

Why not just title and ISBN?

We did socialize this idea. And everyone did think we were nuts.

Page 23: 10% Wrong 90% Wrong

Method to the Madness

Title and ISBN are the most commonly populated fields for identifying unique items.

Records with ISBNs and Titles accounted for over 60% of the bib records in the system. The

remainder included SUDOCs, ISSNs, pre-ISBN items and some that were just plain

garbage.

Page 24: 10% Wrong 90% Wrong

Geronimo

We decided to do it!

Page 25: 10% Wrong 90% Wrong

What Was Left Behind

Records without a valid ISBN.Records without any ISBN (serials, etc..).

Pre-Cat, stubs records, etc…Pure Junk Records.

And other things that would require such extraordinarily convoluted matching that it exceeded the risk even beyond our pain

threshold for a first run.

Page 26: 10% Wrong 90% Wrong

We estimated based on modeling a conservative ~300,000 merges or about 25%

of our ISBNs.

Page 27: 10% Wrong 90% Wrong

The Wisdom of Crowds

Conventional wisdom said that MARC could not be generalized because of unique

information in the records.We were taking risks and very aware of it but

the need to create a large impact on our database drove us to disregard friendly

warnings.

Page 28: 10% Wrong 90% Wrong

An Imperfect World

We knew that we would miss things that could potentially be merged.

We knew that we would create some bad merges.

10% wrong to get it 90% done.

Page 29: 10% Wrong 90% Wrong

Next Step … Normalization

With matching decided we needed to normalize the data. This was done to copies of the production MARC records and that used to

make lists.

Normalization is needed because of variability in how data was entered. It allows us to get the most possible matches based on data.

Page 30: 10% Wrong 90% Wrong

Normalization Details

We normalized case, punctuation, numbers, non-Roman characters, trailing and leading spaces, some GMDs put in as parts of titles,

redacted fields, 10 digit ISBNs as 13 digit and lots, lots more.

This was not done to permanent records but to copies used to make the lists.

Page 31: 10% Wrong 90% Wrong

Weighting

Finally, we had to weight the records that have been matched to determine which should be

the record to keep.

To do this each bib record was given a score to profile its quality.

Page 32: 10% Wrong 90% Wrong

The Weighting Criteria

We looked at the presence, length, and number of entries in the 003, 02X, 24X, 300, 260$b, 100, 010, 500s, 440, 490, 830s, 7XX,

9XX and 59X fields to manipulate, add to, subtract from, bludgeon, poke and eventually determine a 24 digit number that would profile

the quality of a bib record.

Page 33: 10% Wrong 90% Wrong

The Merging

Once the weighing is done the highest scored record in each group is made the master

record, the copies and holds from the others moved to it and those bibs marked deleted.

Page 34: 10% Wrong 90% Wrong

Checking the Weight

We did a report of items that would group based on our criteria and had staff do sample

manual checks to see if they could live with the dominant record.

We collectively checked ~1,000 merges.

Page 35: 10% Wrong 90% Wrong

90 % of the time we felt the highest quality record was selected as the dominant. More

than 9% of the time an acceptable record was selected.

In a very few instances human errors in the record made the system create a bad profile,

but never an actual bad dominant record.

Page 36: 10% Wrong 90% Wrong

The Coding

We proceeded to contract with Equinox to have them develop the code and run it against our test environment (and eventually production).

Galen Charlton was our primary contact in this. In addition to his coding of the algorithm

he also provided input about additional criteria to include in the weighting and normalization.

Page 37: 10% Wrong 90% Wrong

Test Server

Once run on the test server we took our new batches of records and broke them into 50,000 record chunks. We then gave those chunks to

member libraries and had them do random samples for five days.

Page 38: 10% Wrong 90% Wrong

Fixed As We Went

Non-Standard Cataloging (ongoing)13 digit ISBNs normalizing as 10 digit ISBNs. Identified many parts of item sets as issues.

Shared title publications with different formats. The order of the ISBNs.

Kits.

Page 39: 10% Wrong 90% Wrong

In Conclusion

We don’t know how many bad matches were formed.

Total discovered after a year is less than 200.

We were able to purge 326,098 bib records or about 27% of our ISBN based collection.

Page 40: 10% Wrong 90% Wrong

Evaluation

The catalog is visibly cleaner.

The cost per bib record was 1.5 cents.

Absolutely successful!

Page 41: 10% Wrong 90% Wrong

Future

We want to continue to refine it (eg. 020 subfield z).

There are still problems that need to be cleaned up in the catalog – some manually and

some by automation.

Raising Standards.

Page 42: 10% Wrong 90% Wrong

New libraries that have joined SCLENDs use our deduping algorithm not the old one.

It has continued to be successful.

Page 43: 10% Wrong 90% Wrong

Open Sourcing the Solution

We are releasing the algorithm under the Creative Commons Attribution Non-

Commercial license.

We are releasing the SQL code under the GPL.

Page 44: 10% Wrong 90% Wrong

Questions?