VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ......

22
Thomas Hickey Chief Scientist, OCLC Research 2016 August Authority Data on the Web VIAF Reflections

Transcript of VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ......

Page 1: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

Thomas Hickey

Chief Scientist, OCLC Research

2016 August

Authority Data on the Web

VIAF Reflections

Page 2: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

2

Page 3: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

3

Page 4: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

4

Page 5: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

5

Page 6: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

6

Page 7: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

Personal names

Geographic

Corporate

Title

Family

Events

Everything but concepts are considered in scope

National level, but willing to consider other sources

Scope of VIAF

(2009)

Page 8: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

Started with 2 files

• Ed O’Neill’s group gave us DDB & LC matching

• First project: replicate their matching of DDB/LC

• Second: extended to 3 files with BnF

– UNIMARC

– UNICODE

8

Page 9: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

Modest beginnings

• First interface retrieved clusters

– A cluster was just a set of 1-3 source IDs

– Showed preferred form for each

• No linked data

• Hardly any merged information

• But enough to be useful

9

Page 10: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

Matching

• Pairwise matching

– Still how we do it

– Pairwise links used for evaluation as well

• Map-reduce happened

– Method of distributing processing

– Obtained a cluster• Rocks followed by Pebbles and now Gravel

• 400+ CPUs, terabyte of memory, petabytes of disk

– M-R used for much of OCLC’s processing

10

Page 11: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

VIAF

DNB Bib & Authority BnF Bib & Authority LC Bib & Authority

VIAF

~7.5 million personal name authority records

~25 million bibliographic records

~1.2 million links between files

(2008)

Page 12: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

Current size of VIAF

• 44 active participants/files

• 55 million source authority records

• 130 million bibliographic records

• 256 million links between sources

• 30 million external links

• 33 million VIAF clusters

Page 13: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

Sources

• Still concentrating on national libraries and national/international consortia

• But we have

– Getty ULAN

– Wikipedia (Wikidata)

– Perseus

– Syriac

– xR

13

Page 14: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

Communaute ́s europe ́ennes. Cour de justice. Division Bibliothe ̀que

14http://viaf.org/viaf/127884087

Page 15: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

Various type of dates

藤原, 長清, 永仁頃 (Reign)

.هـ1111-1037لمجلسي، محمد باقر بن محمد تقي، (Hijri)

Gregorian, pre-Gregorian

http://journal.code4lib.org/articles/9607

15

Page 16: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

Various dates

Joan, Clímac, sant, s. VI

Joannes, Climacus, 6e/7e E.

Jean Climaque, saint, 0579?-0649?

Jan III (papież ; -574)

John, Climacus, Saint, 6th cent.

Jean III pape 05..-0574

Johannes Klimakos, helgon, 500-talet

Jan Klimakos, svatý, asi 579-asi 649

16

Page 17: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

More date variations

Suetonius, approximately 69-approximately 122

Suetonius Tranquillus, Caius, ca. 69/70-ca. 140

Suetónio, fl. 69-141

Suetonius Tranquillus, Gaius, asi 69-140

Suetonius Tranquillus, Gaius, 69-

Svetonijs, apm. 69-apm. 122

Suetonius Tranquillus, Caius, f. sec. I-II

Suetonius Tranquillus, Caius (ca 70-ca 140)

Suetoni, ca. 69-140

17

Page 18: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

Variant forms for FRBR processing

mcshann, jay

mcshann, jay.leader

mcshann, jay.1909 2006

mcshann, jay.1916 2006

wang, xinlian.1960

王新莲

王新莲.1960

王新莲.singer

de lucia, pepe.leader

lucia, pepe de

pepe de lucia

pepe de lucia.1945

christopher, r

christopher, russel

christopher, Russell

christopher, russell.1930

18

Page 19: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

Finding Works and Expressions

Production enhancedWorldCat PREVIOUS

FRBR CLUSTERS

VIAFGenerated Authorities (xR)

Works & Expressions

Meta Authorities Full Encoding Series

Overrides

FAST GLIMIR Aud Level LCSH, Genres, MeSH

Work Records

Page 20: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

What we did right

• National library participation is critical

• Minimal changes to source data

– Use original IDs

– Original MARC tagging when available

– Flexible on source format, harvest

• Multiple interface languages

• Minimize use of name text for matching

• Linked open data, bulk availability

• Used within OCLC

20

Page 21: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

Other options?

• Stick to the idea of ‘virtual’– No VIAF identifier

• Use MARC internally for the clusters– Currently an ad-hoc XML

– Extensions to MARC-21 make it easier

• More JSON, less XML?

• Avoid immature files

• More mathematical approach to clustering

• Avoid ‘|’ in our internal IDs

• Stricter matching

21

Page 22: VIAF Reflections - OCLC forms for FRBR processing mcshann, jay mcshann, ... FRBR CLUSTERS VIAF ... VIAF Reflections Author:

Overall

• VIAF has been remarkable success

– Support of participating libraries

– Support of OCLC

– Strong demand

– Emphasis on linked data

• It’s been a privilege to work on it!

22