Designing a shared representation Em Tonkin [email protected]
1 Introducing some Standards Paul Miller Interoperability Focus UK Office for Library & Information...
-
Upload
conrad-white -
Category
Documents
-
view
213 -
download
0
Transcript of 1 Introducing some Standards Paul Miller Interoperability Focus UK Office for Library & Information...
1
Introducing some Standards
Paul Miller
Interoperability FocusUK Office for Library & Information Networking (UKOLN)
[email protected] http://www.ukoln.ac.uk/
UKOLN is funded by Resource: the Council for Museums, Archives and Libraries, the Joint Information Systems Committee (JISC) of the Further and Higher Education Funding Councils, as well as by project funding from JISC and the EU. UKOLN also receives support from the Universities of Bath and Hull where staff are based.
2
So… why use standards?
• Benefit from the expertise of others• Enforce rigour in internal practices• Facilitate interoperability (and access)
– Considered deployment of standard solutions makes access to your resources feasible for many.
3
What do standards do?
• Help identify what’s important– CIMI’s “Access Points”– Mandatory fields
• Allow for consistent use of terminology– Name Authority Files– Thesauri– Look–up tables
• Enable internal and external data exchange or access
• Reduce duplication of effort• Minimise (hopefully!) wasted effort• Reflect consensus.
4
What types of standard are there?
• Terminology– ‘Roma’, not ‘Rome’– ‘Roma’ is preferred to ‘Rome’
• Format– ‘Miller, A.P. 1971–’, not ‘Paul Miller’
• ‘Semantics’– A gross simplification, and a very big bucket– ‘Creator’, ‘Subject’, ‘Title’, ‘Description’…
• Syntax– <RDF xmlns = “http://www.w3.org/TR/WD-rdf-syntax#”>
• Transfer– ftp://ftp.niso.org/ … .
5
Terminological Standards
(Based upon an earlier presentation with Matthew Stiff of mda)
See www.ariadne.ac.uk/issue23/metadata/See www.ariadne.ac.uk/issue23/metadata/
6
The need for control…
European Community
E.E.C.
Common Market
European Union !European Union !
7
Without control of terms...
Users are– incorrectly utilising
search terms– failing to find
significant resources
– suffering from information overload
– almost as well using Google
Creators are– cataloguing
inconsistently– unable to convey
hierarchical concepts– Scotland is in
United Kingdom is in Europe is in ...
– perpetuating localised terminology
– unable to assess, let alone undertake, integration projects.
8
With control...Users might
– gain more effective access to a resource
– gain far more effective access across resources
– reduce the number of ‘false hits’
– find what they are looking for
– even learn to think and express themselves in a structured manner.
Creators might– produce more
valuable resources
– convey complex semantic and structural concepts
– move towards disciplinary, national, international or global terminologies
– effectively integrate both new and existing resources.
9
Controlled Vocabulary
European Union
E.E.C.
Common Market
European Community
... Etc.With a controlled vocabulary, one or more of
these terms might be permitted. Use of the others for record creation or retrieval would be rejected by the system.
10
Thesaurus-based Control
European Union [preferred term]
E.E.C. [synonym]
Common Market [synonym]
European Community [synonym]
... Etc. [synonyms]
In a thesaurus, all of the terms might be considered equally valid, with one identified as the preferred term and the others as synonyms
But... Are they really synonymous...?
11
Thesauri
• A traditional thesaurus defines synonyms and, perhaps, antonyms for terms within a given language.
• E.g.– ‘workshop’
atelier, factory, mill, plant, shop, studio, workroom
...or... ?
class, discussion group, seminar, study group.
12
Thesauri in Information Retrieval
• In the context of information retrieval, thesauri do more, facilitating the creation of hierarchies of meaning... .
13
Hierarchies of Meaning
‘Glass’
‘Beer Glass’
‘Wine Glass’
‘Red wine glass’
‘White wine glass’
14
Thesaurus Components
• Most thesauri are constructed in a standard form, as defined by ISO 2788 and various national standards.
– ISO 5964 extends discussion to multilingual issues
• Four basic relationships are fundamental in thesaurus construction and use...
– Equivalence (preferred and non-preferred terms)
– Hierarchy (‘glass’ is broader than ‘wine glass’)
– Association (establishes non-hierarchical relationships)
– Scope notes (provide guidance and clarification).
15
Equivalence
• As with the European Union example, there are often situations in which users or cataloguers wish to allow multiple synonyms for any one term.
– In these cases, one term may be defined as a preferred term
“Electricity PlantUSE Power Station”
– Here, ‘Power Station’ is the preferred termExample from RCHME Thesaurus of Monument Types, © RCHME 1995.
16
Hierarchy
• An important capability of thesauri is their ability to reflect hierarchies, whether conceptual, spatial, or whatever.
– Individual thesaurus entries are linked to a class (CL), as well as to broader (BT) and narrower (NT) terms.
“BAYONETCL Armour and WeaponsBT Edged WeaponNT Plug BayonetNT Socket Bayonet” Example from mda Archaeological Objects Thesaurus, © mda, English Heritage, RCHME 1997.
17
Association
• In any large thesaurus, a significant number of terms will mean similar things or cover related areas, without necessarily being synonyms or fitting into a defined hierarchy.
– Related Terms (RT) can be used to show these links within the thesaurus.
“CHURCHRT ChurchyardRT CryptRT Presbytery” Example from RCHME Thesaurus of Monument Types, © RCHME 1995.
18
Scope Notes
• Thesaurus entries can often be terse, and difficult to interpret for the non-expert.
– Scope Notes (SN) serve to clarify entries and avoid possible confusion. They serve to embody the underlying concept, rather than the language-specific word.
“CHITTING HOUSESN A building in which potatoes can sprout
and germinate”“FERRY
SN Includes associated structures” Examples from RCHME Thesaurus of Monument Types, © RCHME 1995.
19
Putting it all together...
“FERROUS METAL EXTRACTION SITE
SN Includes preliminary processing
CL Industrial
BT Metal Industry Site
NT Ironstone Mine
NT Ironstone Pit
NT Ironstone Workings
RT Ironstone Workings”Example from RCHME Thesaurus of Monument Types, © RCHME 1995.
20
Working with the tools
• Thesauri, controlled vocabulary lists, etc, are all useful, but they
– often rely upon both cataloguers and users having direct access to these usually weighty tomes
– require an awareness of cataloguing issues and practice to be used most effectively
– have predominantly developed within –– rather than between –– communities, regions, etc.
– rapidly become destabilised as distributed users add new terms in a non–complimentary fashion
21
Effective distributed thesauri [1]
• In order for thesauri to be effective in the online environment, research and good practice need to address;
– mapping between existing thesauri– technical mapping
– semantic mapping
• are ‘E.E.C.’ and ‘Common Market’ synonymous?
– restructuring one or both where necessary/ possible
– inter–disciplinary mapping
• the ‘God Problem’
– addressing legacy data
22
Effective distributed thesauri [2]
– delivery of training to remote cataloguers– providing online access to more existing thesauri– development of cataloguing tools
– capable of accessing various remote thesauri and selecting terms in an intuitive, timely, fashion
• Nordic Metadata Project Dublin Core tool
– raising the profile of thesauri as “A Good Thing”!– Development of user interface tools
– capable of integrating various remote thesauri into the search process without slowing it intolerably, losing contextual awareness or subjecting the browser to information overload.
23
Some links
• English Heritage Thesauri• www.rchme.gov.uk/thesaurus/thes_splash.htm
• Getty Thesauri• www.getty.edu/gri/vocabularies/
• HASSET• biron.essex.ac.uk/searching/zhasset.html
• HIgh Level Thesaurus Project (HILT)• hilt.cdlr.strath.ac.uk/
• Pan–Government Thesaurus• Should be visible from www.govtalk.gov.uk/
eventually.
24
Metadata
25
What is ‘Metadata’?
– meaningless jargon
– ora fashionable, and terribly misused, term for what we’ve always done
– or“a means of turning data into information”
– and“data about data”
– andthe name of a person (‘Tony Blair’)
– andthe title of a book (‘The Name of the Rose’).
26
What is ‘Metadata’?
• Metadata exists for almost anything;• People• Places• Objects• Concepts• Web pages• Databases.
27
What is ‘Metadata’?
• Metadata fulfils three main functions;• Description of resource content
– “What is it?”
• Description of resource form– “How is it constructed?”
• Description of resource use– “Can I afford it?”.
28
Challenges
Many flavours of metadatawhich one do I use?
Managing changenew varieties, and evolution of
existing forms
Tension between functionality and simplicity, extensibility and interoperability
Functions, features, and cool stuff Simplicity and interoperability
Opportunities
29
Introducing the Dublin Core
• An attempt to improve resource discovery on the Web
– now adopted more broadly
• Building an interdisciplinary consensus about a core element set for resource discovery
– simple and intuitive– cross–disciplinary — not just libraries!!– international– open and consensual– flexible.
See purl.org/dc/See purl.org/dc/
30
• 15 elements of descriptive metadata• All elements optional• All elements repeatable• The whole is extensible
– offers a starting point for semantically richer descriptions.
Introducing the Dublin Core
31
• Title• Creator• Subject• Description• Publisher• Contributor• Date• Type
• Format• Identifier• Source• Language• Relation• Coverage• Rights
purl.org/dc/
Introducing the Dublin Core
32
Z39.50
33
What is Z39.50?
• ANSI/NISO Z39.50–1995, Information Retrieval (Z39.50): Application Service Definition and Protocol Specification
• ISO 23950:1998, Information and Documentation — Information Retrieval (Z39.50) — Application Service Definition and Protocol Specification.
See lcweb.loc.gov/z3950/agency/1995doce.htmlSee lcweb.loc.gov/z3950/agency/1995doce.html
34
What is Z39.50?
“This standard specifies a client/server based protocol for Information Retrieval. It specifies procedures and structures for a client to search a database provided by a server, retrieve database records identified by a search, scan a term list, and sort a result set. Access control, resource control, extended services, and a ‘help’ facility are also supported. The protocol addresses communication between corresponding information retrieval applications, the client and server (which may reside on different computers); it does not address interaction between the client and the end-user.”
(Z39.50–1995, page 0).
See lcweb.loc.gov/z3950/agency/1995doce.htmlSee lcweb.loc.gov/z3950/agency/1995doce.html
35
Some gory details…• Z39.50 follows client/server model
• But calls them Origin and Target
Client/origin
Server/target
36
Client/Server architecture
37
Client/Server architecture
38
Some gory details…
• Z39.50–1995 is divided into eleven ‘Facilities’
Initialization Search
Retrieval Result–set–delete
Browse Sort
Access Control Accounting
Explain Extended Services
Termination.
See www.ariadne.ac.uk/issue21/z3950/See www.ariadne.ac.uk/issue21/z3950/
39
Facilities and Services
• Each Facility comprises at least one Service• A Service facilitates a particular
interaction between Origin and Target• The three key services are Init,
Search, and Present.
See www.ariadne.ac.uk/issue21/z3950/See www.ariadne.ac.uk/issue21/z3950/
40
Init
• The only Service of the Initialization Facility
• Origin–initiated
• Used to start a ‘Z–association’• Origin requests a number of
parameters under which the searches will be conducted
• Target responds, either accepting offered parameters or proposing others if necessary.
41
Search
• The only Service of the Search Facility
• Origin–initiated
• Used to actually conduct a search• Origin specifies databases to be
searched, attribute combinations, and query
• Target responds, identifying the number of matching results.
42
Present
• Main Service of the Retrieval Facility (along with Segment)
• Origin–initiated• although Target can initiate a Segment
request if the result set is very large
• Used to return records to the user.
43
Init for dummies
Hello. Do you speak English?
Hello. Yes, I do. Let’s talk.
44
Search for dummies
Cool. Can I have anything you’ve got on a place
called “London”?
I’ve got 25 records matching your request, and here’s the first five. As you didn’t
specify anything else, I’ve sent them to you in MARC, so I hope
that’s OK.
45
Present for dummies25, eh? Can I have the first ten, please? Oh, and I really don’t like
MARC. If you can send Dublin Core that would be great, and if not I’ll
settle for some SUTRS.
DC:Creator – blahDC:Title – blah…
46
Now it gets hairy…
• To communicate successfully, Origin and Target need to use the same Attribute Set.• An Attribute Set like Bib–1 defines six
forms of Attribute —– Use– Relation– Truncation– Completeness– Position– Structure.
47
Use Attributes
• Define the ‘access points’ on which a search takes place• Title, author, subject, etc.
See lcweb.loc.gov/z3950/agency/defns/bib1.htmlSee lcweb.loc.gov/z3950/agency/defns/bib1.html
48
Relation Attributes
• Defines the relationship between the search term and values stored in the database/index• Less than, greater than, equal to,
phonetically matched, etc.
49
Truncation Attributes
• Defines which part of the stored value is to be searched on• Beginning of any word, end of any
word, etc.• ‘Smith’ finds ‘Smithsonian’ and not
‘Wordsmith’, and vice versa.
50
Completeness Attributes
• Defines how much of the stored index term must be in the search term• ‘Smith’ finds ‘Smith’, but not
‘Smithsonian’ or ‘the Smith’, etc.
51
Position Attributes
• Defines where in the index the search term should be located• At the start of the field, anywhere, etc.
52
Structure Attributes
• Specifies the form to be searched for• Word, phrase, date, etc.
53
Record Syntaxes• Record Syntaxes define the structure in which
results are returned to the Origin.• This does not mean that Targets need to store data
in these formats
• MARC• UKMARC, USMARC/MARC21, DANMARC, MARB,
UNIMARC…
• SUTRS• Simple Unstructured Text Record Syntax
• GRS–1• Generic Record Syntax
• XML.
54
Profiles• Groupings of Attribute Sets, Record
Syntaxes, etc. to meet specific needs• Disciplinary
– Cultural Heritage (CIMI)– Geospatial (GEO)
• Geographic/Cultural/National– Texas Profile– OPAC Network for Europe (ONE)– Conference of European National Librarians (CENL)
• Functional– Collections Profile
• Etc.
55
What’s wrong with Z39.50?• Profiles for each discipline
• Defeats interoperability?
• Vendor interpretation of the standard
• Bib–1 bloat
• Largely invisible to the user
• Seen as complicated, expensive and old–fashioned
• Surely no match for XML/RDF/ whatever.
56
Some Joined up working:
The Bath Profile• Vendors and systems implement areas of the Z39.50 standard differently
• Regional, National, and disciplinary Profiles have appeared over previous years, many of which have basic functions in common
• Users wish to search across national/regional boundaries, and between vendors.
See www.ariadne.ac.uk/issue21/z3950/See www.ariadne.ac.uk/issue21/z3950/
57
Learning from the past
• The Bath Profile is heavily influenced by• ATS–1• CENL• DanZIG• MODELS• ONE• Z Texas• vCUC
See www.ukoln.ac.uk/interop–focus/bath/See www.ukoln.ac.uk/interop–focus/bath/
58
Learning from the past
See www.ukoln.ac.uk/interop–focus/bath/See www.ukoln.ac.uk/interop–focus/bath/
59
Doing the work
• ZIP–PIZ–L mailing list, hosted by National Library of Canada
• Meeting face–to–face• JISC supported a face–to–face meeting in Bath
(UK) over the summer of 1999
• A draft was widely circulated for comment• ISO accreditation process
• Resulting in Internationally Registered Profile status
• Ongoing Maintenance Agency activity.See www.ukoln.ac.uk/interop–focus/bath/See www.ukoln.ac.uk/interop–focus/bath/
60
Makx DekkersPricewaterhouseCoopers/ EC
Janifer GatenbyGEAC
Juha HakalaNational Library of Finland
Poul Henrik JørgensenDanish Library Centre
Carrol LunauNational Library of Canada
Paul MillerUKOLN
Slavko ManojlovichSIRSI/ Memorial University of Newfoundland
Bill MoenUniversity of North Texas
Judith PearceNational Library of Australia
Joe ZeemanCGI.
Doing the work
See www.ukoln.ac.uk/interop–focus/bath/See www.ukoln.ac.uk/interop–focus/bath/
61
What we proposed
• Minimisation of ‘defaults’• Where possible, every attribute is defined in the Profile
(Use, Relation, Position, Structure, Truncation, Completeness)
• Three Functional Areas• Basic Bibliographic Search & Retrieval• Bibliographic Holdings Search & Retrieval• Cross–Domain Search & Retrieval
• Three Levels of Conformance in each Area.
See www.ukoln.ac.uk/interop–focus/bath/See www.ukoln.ac.uk/interop–focus/bath/
62
What we proposed
• SUTRS or XML and UNIMARC or MARC21 for Bibliographic Search results
• SUTRS and Dublin Core (in XML) for Cross–Domain results
• Other record syntaxes also permitted, but conformant tools must support at least these.
See www.ukoln.ac.uk/interop–focus/bath/See www.ukoln.ac.uk/interop–focus/bath/
63
Making it work…
• Adopted already by Texas, Atlantic Canada, CIC (Big 10), CENL, etc.
• Interoperability suite• MARC21 in Texas• UNIMARC and cross–domain in Europe?
• Direct approaches to international vendors• User testing in Europe and North America• Addition of Functional Areas and Levels of
Conformance as required• Community Information?
See www.ukoln.ac.uk/interop–focus/bath/See www.ukoln.ac.uk/interop–focus/bath/
64
Standards…
• Technical standards make the job easier in the long run for users, curators, and managers• but can make it harder to get started
• There is rarely a ‘right’ standard for all situations• so identify a need to do something, without
being specific about how• know who your audience is, what you have to
offer, and what your purpose/message is.
.