Post on 31-Dec-2015
description
FRBR: Algorithms and Applications
T. HickeyJ. TovesD. Vizine-Goetz
Online Compuer Library Center
CLA November 2004
Outline
Algorithms• FRBR work matching• Handling author-title variants
Hardware• Beowulf cluster
Applications• Bookmarklets• FictionFinder
Future directions
Working with Group 1 Entities
WEMI:Work
ExpressionManifestation
Item
Strict expression-level determination is hard• We primarily divide by language
Manifestation is easier• We use the WorldCat master record
Work Identification
Algorithm goals:• Efficient• Understandable• Controllable by catalogers• Uses existing WorldCat records
The Algorithm
A key is generated for each record Extract author, title
• Look up in LC name authority file• Added entry information as needed
Form a key from bibliographic record• Author, title, added entry information• These can be sorted, compared
Example
146 Smollett\1721 Expedition of Humphry Clinker
16 Smollett\1721 Expedition of Humphrey Clinker
8 Smollett\1721 Humphry Clinker
4 Smollett\1721 Humphrey Clinker
2 Smollett\1721 Expedition of Humphry Clinker
1 Smollett\1721 Calatoriile lui Humphrey Clinker
1 Smollet\1721 Expedition of Humphry Clinker
1 Smollett Humphry Klinkers Reisen
Example (with authorities)
156 Smollett\1721 Expedition of Humphry Clinker
16 Smollett\1721 Expedition of Humphrey Clinker
4 Smollett\1721 Humphrey Clinker
1 Smollett\1721 Calatoriile lui Humphrey Clinker
1 Smollet\1721 Expedition of Humphry Clinker
1 Smollett\1721 Humphry Klinkers Reisen
More Detail
Extract author names• Look up in authority file
• Currently only personal names• Subfields $abcdq
Extract title• Always use uniform titles if present• Look up author/short title (~$a)• Look up author/long title (~$abfgnp)• Prefer alternative title for non-English
Create key from author/title• Always do NACO normalization (has limitations)• Add information for uncontrolled title-main-entry
Authority Files Rule!
Authors Author/titles
Bring together variations Allow override in difficult cases
• Both splitting and joining groups• Especially important with xISBN matching
Especially important with non-English metadata
Limitations of the Authority File
What’s missing:• Many uniform titles• Many author variants• Many title variants• Language of heading
Partial solution• Create auxiliary files of mechanically generated matches
Results of FRBR Matching on WorldCat
88% of manifestations are ‘singletons’ 30% of manifestations are in 12% of the works Average size of multiple matches: 3.1 manifestations/work 43.1 million works in 54 million manifestations 54% of holdings on a FRBR work with >1 manifestation WorldCat manifestations average about 20 holdings
FRBR helps where help is most needed
More FRBR Results
310,000 works have more than 5 manifestations 1.7 million have more than 2 manifestations
Largest: 30,000+ for the Bible 1,537 Shakespeare’s Macbeth 1,026 Dickens’s Christmas Carol
The Top 10 Works by Holdings
Work Holdings Manif’s
1 US Census (various) 403,252 10,164
2 Bible (combined) 271,534 36,738
3 Mother Goose 66,543 1,997
4 Dante, The Divine Comedy 59,034 2,714
5 Homer, The Odyssey 43,871 2,009
6 Homer, The Iliad 42,756 2,388
7 Twain, Huckleberry Finn 39,310 1,093
8 Shakespeare, Hamlet 37,683 1,917
9 Carroll, Alice’s Adventures in Wonderland 37,614 1,865
10 Tolkien, Lord of the Rings 37,461 643
The Top 10 Works Cataloged in 2003
Work Libraries
1 Rowling, Harry Potter and the Order of the Phoenix 2,406
2 Clinton, Living History 36,738
3 Rohmann, My Friend Rabbit 1,997
4 Brown, The Da Vinci Code 2,714
5 Gibaldi, MLA Handbook 2,009
Top 1000 Publication Dates
Top 1000 Languages
Our Beowulf Cluster
24 Nodes• Each with 2x2.6 GHz processors• 4 GBytes memory (96 GBytes total)
One ‘head’ node, 23 ‘compute’ nodes
46x40 GBytes disk (~2 Terabytes total)
Gigabit switch
What we are using it for
All our bibliographic processing• FRBR• Extractions• Searching• Matching
Ganglia load visualization
Starting point
FRBR key generation
25 hours on a 3.00GHz workstation with 2GB of RAM
Generate two key files• sort by key, uniq by key, sort by occurrence• sort by key, post processing on keys, uniq by key, sort
by occurrence
Merge key files
FRBR on the Cluster
44 minutes on the cluster
69 key builders & 23 sort buckets with hyperthreading ON
Generate 23 radix-sorted, post-processed key files
Collapse and sort by occurrence in parallel
Also outputs additional files used by other jobs
Application: Preservation
Identify ‘final copy’ items Do it at the work level
Single-singles• Single manifestations with single holding• Found 18 million in WorldCat
Application: xISBN
A simple Web service
Given an ISBN:• Identify the workset it is in• Return all other ISBNs in that workset
Results should be symmetrical!• Same group retrieved for each ISBN in group
ISBNs sorted by number of library holdings
xISBN Example
http://labs.oclc.org/xisbn/0-19-281664-0 returns:<?xml version="1.0" encoding="UTF-8" ?><idlist>
<isbn>0192816640</isbn><isbn>0820312037</isbn><isbn>0820315370</isbn><isbn>0393015920</isbn><isbn>0393952274</isbn><isbn>0393952835</isbn><isbn>0140430210</isbn><isbn>0192811320</isbn><isbn>0192835947</isbn><isbn>0460872885</isbn><isbn>1853262706</isbn><isbn>0874131219</isbn>
</idlist>
Matching on ISBNs
ISBN additional information beyond Author/Title• Allows relaxation of matching• Introduces possible errors
Offers the possibility of substantial improvement of work matching
Merging Worksets Using ISBN Matches
Pair ISBNs with FRBR keys(Starts with 10 million ISBNs)
Throw out ISBNs in single worksets Throw out ISBNs in > 5 worksets
(We now have 561,000 ISBNs left) Are the titles similar enough? Throw out large groups
Try to be very conservative Authority file always overrides other matching
Matches from ISBN Matching
74,000 author variants ~200,000 title variants
These all create additional cross reference records Automatically folded into FRBR matching Kept separate from NACO file
• Only used in research at this time
Examples of Possible Matches
/mcgraw hill encyclopedia of science & technology /mcgraw hill encyclopedia of science & technology\1\aar aor /mcgraw hill encyclopedia of science & technology\2\apa boo /mcgraw hill encyclopedia of science & technology\3\bor cle /mcgraw hill encyclopedia of science & technology\4\cli cyt …
dickens, charles\1812 1870/tale of two cities dickens, charles\1812 1870/hard times dickens, charles\1812 1870/sketches by boz dickens, charles\1812 1870/martin chuzzlewit dickens, charles\1812 1870/bleak house dickens, charles\1812 1870/little dorrit dickens, charles\1812 1870/oliver twist …
Application: Bookmarklets
Clicking on Princeton
FictionFinder
Indexes fiction from WorldCat Uses FRBR workset algorithm Focused on fiction Searching and browsing by
• Genre• Fictitious Characters• Imaginary Places• Literary Forms
Links to• Google• Open WorldCat
Diane Vizine-Goetz’s project
‘Humphry Clinker’ Search
Work Display
Detail of Language Display
First Few English Manifestations
Manifestation Display
Open WorldCat Link
Additional Matches
Match variant titles:• When the wind blows• When the wind blows: a novel
FictionFinder identified 10,000 of similar variations• novela, novella, roman, …
Created auxiliary authority records Now automatically used when FRBR algorithm is run
Future
Continued development of FictionFinder Extending algorithm to serials? FirstSearch displays Additional matching criteria Local authority files? Integration of auxiliary files for production? Exploring FRBRizing some European catalogs Looking at extending beyond Roman characters
Links
IFLA FRBR - Final Report• http://www.ifla.org/VII/s13/frbr/frbr.htm
Article in DLib• http://www.dlib.org/dlib/september02/hickey/09hickey.ht
ml OCLC Research Activities with FRBR
• http://www.oclc.org/research/projects/frbr/ FictionFinder
• http://fictionfinder.oclc.org/ Top 1000
• http://www.oclc.org/research/top1000/