VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ......

Post on 14-Mar-2018

225 views 5 download

Transcript of VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ......

Thomas Hickey

Chief Scientist, OCLC Research

2016 August

Authority Data on the Web

VIAF Reflections

2

3

4

5

6

Personal names

Geographic

Corporate

Title

Family

Events

Everything but concepts are considered in scope

National level, but willing to consider other sources

Scope of VIAF

(2009)

Started with 2 files

• Ed O’Neill’s group gave us DDB & LC matching

• First project: replicate their matching of DDB/LC

• Second: extended to 3 files with BnF

– UNIMARC

– UNICODE

8

Modest beginnings

• First interface retrieved clusters

– A cluster was just a set of 1-3 source IDs

– Showed preferred form for each

• No linked data

• Hardly any merged information

• But enough to be useful

9

Matching

• Pairwise matching

– Still how we do it

– Pairwise links used for evaluation as well

• Map-reduce happened

– Method of distributing processing

– Obtained a cluster• Rocks followed by Pebbles and now Gravel

• 400+ CPUs, terabyte of memory, petabytes of disk

– M-R used for much of OCLC’s processing

10

VIAF

DNB Bib & Authority BnF Bib & Authority LC Bib & Authority

VIAF

~7.5 million personal name authority records

~25 million bibliographic records

~1.2 million links between files

(2008)

Current size of VIAF

• 44 active participants/files

• 55 million source authority records

• 130 million bibliographic records

• 256 million links between sources

• 30 million external links

• 33 million VIAF clusters

Sources

• Still concentrating on national libraries and national/international consortia

• But we have

– Getty ULAN

– Wikipedia (Wikidata)

– Perseus

– Syriac

– xR

13

Communaute ́s europe ́ennes. Cour de justice. Division Bibliothe ̀que

14http://viaf.org/viaf/127884087

Various type of dates

藤原, 長清, 永仁頃 (Reign)

.هـ1111-1037لمجلسي، محمد باقر بن محمد تقي، (Hijri)

Gregorian, pre-Gregorian

http://journal.code4lib.org/articles/9607

15

Various dates

Joan, Clímac, sant, s. VI

Joannes, Climacus, 6e/7e E.

Jean Climaque, saint, 0579?-0649?

Jan III (papież ; -574)

John, Climacus, Saint, 6th cent.

Jean III pape 05..-0574

Johannes Klimakos, helgon, 500-talet

Jan Klimakos, svatý, asi 579-asi 649

16

More date variations

Suetonius, approximately 69-approximately 122

Suetonius Tranquillus, Caius, ca. 69/70-ca. 140

Suetónio, fl. 69-141

Suetonius Tranquillus, Gaius, asi 69-140

Suetonius Tranquillus, Gaius, 69-

Svetonijs, apm. 69-apm. 122

Suetonius Tranquillus, Caius, f. sec. I-II

Suetonius Tranquillus, Caius (ca 70-ca 140)

Suetoni, ca. 69-140

17

Variant forms for FRBR processing

mcshann, jay

mcshann, jay.leader

mcshann, jay.1909 2006

mcshann, jay.1916 2006

wang, xinlian.1960

王新莲

王新莲.1960

王新莲.singer

de lucia, pepe.leader

lucia, pepe de

pepe de lucia

pepe de lucia.1945

christopher, r

christopher, russel

christopher, Russell

christopher, russell.1930

18

Finding Works and Expressions

Production enhancedWorldCat PREVIOUS

FRBR CLUSTERS

VIAFGenerated Authorities (xR)

Works & Expressions

Meta Authorities Full Encoding Series

Overrides

FAST GLIMIR Aud Level LCSH, Genres, MeSH

Work Records

What we did right

• National library participation is critical

• Minimal changes to source data

– Use original IDs

– Original MARC tagging when available

– Flexible on source format, harvest

• Multiple interface languages

• Minimize use of name text for matching

• Linked open data, bulk availability

• Used within OCLC

20

Other options?

• Stick to the idea of ‘virtual’– No VIAF identifier

• Use MARC internally for the clusters– Currently an ad-hoc XML

– Extensions to MARC-21 make it easier

• More JSON, less XML?

• Avoid immature files

• More mathematical approach to clustering

• Avoid ‘|’ in our internal IDs

• Stricter matching

21

Overall

• VIAF has been remarkable success

– Support of participating libraries

– Support of OCLC

– Strong demand

– Emphasis on linked data

• It’s been a privilege to work on it!

22