Murtha Baca

72
SLA Seattle SLA Seattle June 16, 2008 June 16, 2008 “Seek, and ye shall find:” Using Controlled Vocabularies to Enhance Access to Cultural Information

description

Using Controlled Vocabularies to Enhance Access to Cultural Information

Transcript of Murtha Baca

Page 1: Murtha Baca

SLA SeattleSLA Seattle

June 16, 2008June 16, 2008

SLA SeattleSLA Seattle

June 16, 2008June 16, 2008

“Seek, and ye shall find:”

Using Controlled Vocabularies to Enhance Access

to Cultural Information

Page 2: Murtha Baca

Murtha BacaMurtha Baca

Head, Vocabulary ProgramHead, Vocabulary Program

Getty Research InstituteGetty Research Institute

SLA SeattleSLA Seattle

June 2008June 2008

Murtha BacaMurtha Baca

Head, Vocabulary ProgramHead, Vocabulary Program

Getty Research InstituteGetty Research Institute

SLA SeattleSLA Seattle

June 2008June 2008

Controlled Vocabularies:an Overview

Page 3: Murtha Baca

TYPOLOGY of DATA STANDARDS

Data structure standards (metadata element sets):MARC, EAD, Dublin Core, CDWA, VRA Core

Data content standards (cataloging rules):AACR (RDA), ISBD, CCO, DA:CS

Data value standards (vocabularies):LCSH, LCNAF, TGM, AAT, ULAN , TGN, MeSH

Data format standards (standards expressed in machine-readable form):MARC, MARCXML, EAD, CDWA Lite XML, Dublin

Core Simple XML schema, DC Qualified XML schema, VRA Core XML schema

Page 4: Murtha Baca

What are vocabularies?

• Maps to guide people to information– creating / filling– searching / researching– organizing / classifying / thinking

• Collections of terminology where relationships between terms are represented

• Data value standards (i.e. what is used to “fill” metadata elements/categories or “containers” of information)

Page 5: Murtha Baca

“Knowledge bases” -- bodies of knowledge represented by language (glossaries, dictionaries, thesauri, word lists)

What are vocabularies?

Page 6: Murtha Baca

Types of terms in vocabularies

personal names: Collate, Charles B. geographic names: Campbeltown

(Argyll and Bute, Scotland, UK) object names: clack valve corporate names: Cambrian Railways iconographic subjects and themes:

The Legend of John Henry genre terms: political cartoons, fish

stories multilingual equivalents: flat car

(English) = Schienenwagen (German) = platforma (atklata) (Latvian)

Page 7: Murtha Baca

What is a controlled vocabulary?

A tool for consistency in the language used in the recording and retrieval of information

Page 8: Murtha Baca

What is a controlled vocabulary?

An organized arrangement of words and phrases that are used to index content and/or to retrieve content through navigation or a search

Typically a vocabulary that includes preferred terms and has a limited scope or describes a specific domain

Page 9: Murtha Baca

Types of Controlled

Vocabularies

Page 10: Murtha Baca

Controlled Lists

Simple lists of terms used to control terminology

In a well-constructed controlled list: Each term must be unique (no homographs). Terms should all be members of the same

class. Terms should not be overlapping in meaning Terms should be equal in granularity or

specificity. Terms are arranged alphabetically or in

another logical order.

Page 11: Murtha Baca

Controlled Lists cont. May include terms from other controlled vocabulary resources (especially standard published vocabularies)

For some elements or fields in a database, a controlled list may be sufficient to control terminology, particularly where the terminology for that field is limited and unlikely to have synonyms or ancillary information. (Example: artists’ roles in ULAN, place types in TGN).

Page 12: Murtha Baca

Controlled list: A simple list of terms used to control terminology

manuscriptsmiscellaneouspaintingsphotographssculpturesite Installationtextsvessels

Example of a controlled pick list for Classification

Patricia Harpring, 2008 Patricia Harpring, 2008 © J. Paul Getty Trust

Page 13: Murtha Baca

A list comprising sets of terms that are considered equivalent

No preferred term Generally used for search and retrieval, providing

access to content that is represented in natural, uncontrolled language

Felis domesticus

Synonym ring list

Jean-Baptiste Perroneau, Portrait of Magdaleine Pinceloup, © J. Paul Getty Museum; Chat Noir, Theophile-Alexandre Steinlen, © Sta. Barbara Museum of Art. Egyptian Cat, © Metropolitan Museum. Cat and Kittens, © National Gallery of Art. Maneki Neko, Japanese, © private collection.

© J. Paul Getty Trust; Patricia Harpring 2008

domestic catcat

Felis catushouse cat

Page 14: Murtha Baca

Compilations, usually in alphabetical order, that combine separate concepts into a “string,” as in the Library of Congress Subject Headings (LCSH)

Commercial fishing -- Japanese competition

Salmon fisheries -- law and legislation -- California

Subject Headings

Page 15: Murtha Baca

Pre-coordination of terminology is a characteristic of subject headings; subject headings typically combine several unique concepts together.

Subject Headings cont.

Subject headings--Pictures.

Pictures--Computer network resources.

World Wide Web--Subject access.

Page 16: Murtha Baca

Taxonomies/Classifications

Vocabularies that organize a body of knowledge for a defined domain into conceptual categories, e.g. Nomenclature for Museum Cataloging, ICONCLASS.The Greek heroic legends Story of Hercules (Heracles) Labors of Hercules

Hercules chokes the Nemean lionHercules kills the Hydra of LernaHercules captures the Ceryneian hindHercules captures the Cretan bull

http://www.iconclass.nl/

Page 17: Murtha Baca

Compilations of terms representing single concepts. Thesauri explicitly express relationships among terms via a semantic structure.

<visual works by form>dioramasdiptychsmedals

medallions (medals)polyptychstriptychs

Thesauri

Page 18: Murtha Baca

Authority Files• Compilations of authorized terms or

headings used by a single information system, organization, or consortium for cataloging, indexing, and documentation.

• Main purpose is to regulate usage.• Include synonyms (“See” references) and

related or associated terms (“See also” references).

• Examples include Library of Congress Name Authority File (LCNAF), local authorities for names, subjects, etc.

• Authority files may take the form of thesauri, word lists, etc.—in other words, any kind of vocabulary can be used as an authority.

Page 19: Murtha Baca

More on Thesauri

Page 20: Murtha Baca

Thesauri

Terms in a thesaurus may have the following three types of relationships:

Equivalence Hierarchical Associative

Page 21: Murtha Baca

Thesaural Relationships

• Equivalence– synonyms, spelling variations,

language variations

• Hierarchical– broader to narrower

•whole/part•genus/species

• Associative– related concepts

Page 22: Murtha Baca

Equivalence Relationship: Terms/names

denote the same thing—a preferred name is used for

displays

Bulgarini, Bartolomeo (Sienese painter, circa

1337-1378) Lorenzetti, Ugolino Master of the Ovile Madonna Ovile Master example from example from

ULANULAN

Page 23: Murtha Baca

Equivalence Relationship

still lifesstill lifestill-lifes still lives nature morte natura morta stilleven Stilleben vie coye ontbijtjebanketje

Page 24: Murtha Baca

Whole/Part Relationship: “children” or narrower terms are part of the

parent or broader term

España..........................(nation) Andalucía.......................

(region) Almería.........................

(province) Cádiz...........................

(province) Córdoba.........................

(province) Granada.........................

(province) Huelva..........................

(province) Málaga..........................

(province) Sevilla.........................

(province)

Page 25: Murtha Baca

Genus/Species Relationship: “children” represent types of the “parent” or broader term

funerary sculpture brasses effigies gisants... haniwa tomb slabs ushabti

Page 26: Murtha Baca

Associative Relationship: terms are related conceptually, but not

necessarily hierarchically

Descriptor: charterhouses Hierarchy: Built Complexes and DistrictsScope note - Carthusian monasteries. Alternate Forms of Speech {ALT}: charterhouse Synonyms and spelling variants {UF}: certose charter houses chartreuses Related concepts: Carthusian (Religions hierarchy)

Page 27: Murtha Baca

indexer thesaurus: A thesaurus designed to control terminology and guide indexers in the choice of terms. See also end-user thesaurus.indexing: Also called human indexing. The process of evaluating information and designating indexing terms by using controlled vocabulary that will aid in finding and accessing the cultural work record. Refers to indexing done by human labor, not to the automatic parsing of data into a database index, which is used by a system to speed up search and retrieval.

indexer thesaurus

Page 28: Murtha Baca

A thesaurus designed for direct access by searchers rather than for use by indexers. Instead of controlling the terminology, the purpose of an end-user thesaurus is to help searchers find useful terminology for improving, narrowing, and broadening their queries.

end-user thesaurus

Page 29: Murtha Baca

A vocabulary constructed with the goal of being interoperable with an existing vocabulary, e.g. a specialty vocabulary such as a conservation thesaurus that is intended to be linked to the superstructure of a larger vocabulary, such as the AAT.

satellite vocabulary

Page 30: Murtha Baca

Vocabularies provide

intellectual “paths” that can improve access to information

Harlem Renaissance

Negro Renaissance

New Negro Movement

Renaissance, Harlem

Renaissance, NegroJacob Lawrence Tombstones, 1942

Example from the Example from the AATAAT

Page 31: Murtha Baca

Why do we need vocabularies?

• Because of national and regional differences: lorries vs. trucks, lifts vs. elevators, Tom Thumb golf courses vs. miniature golf courses

• Because of historical vs. contemporary names: Iran vs. Persia vs. Islamic Republic of Iran

• Because of political and social changes: KhoiKhoi vs. Hottentot

• Because of linguistic differences: Titian vs. Tiziano vs. Titien; pottery vs. keramik vs. céramique

• To disambiguate homographs: sinopia (pigment -- Materials hierarchy) vs. sinopia (preliminary drawing -- Visual Works hierarchy)

Page 32: Murtha Baca

Why do we need vocabularies?

Thesaural relationships provide greater research/searching capabilities:

drawings<drawings by function>

preliminary drawingsunderdrawings

sinopie

Page 33: Murtha Baca

Issues in vocabulary-enhanced searching

• User interfaces are problematic• Optimally, controlled vocabularies

should be used both on the “back end” and on the “front end” to be most effective

• Economics: consistent implementation of controlled vocabularies is time- and labor-intensive

• Vocabulary control is almost non-existent on the Web at present

Page 34: Murtha Baca

Search “ARES” Against Getty Web site

Page 35: Murtha Baca

“ARES” did not match any pages

Page 36: Murtha Baca

Improve recall by ORing equivalent names (Ares, Mars)

Page 37: Murtha Baca

“Ares OR Mars” now retrieves 37 pages

Page 38: Murtha Baca

Search “ARES” Against Google (returns 1,250,000 pages; none

of first 6 pages are relevant)

Page 39: Murtha Baca

Increase precision by ANDing the broader/parent term of ARES, “Major

Gods”

Page 40: Murtha Baca

“Ares AND Major Gods” now narrow to 506 hits (all first 7 pages are

relevant)

Page 41: Murtha Baca

Recall and Precision

Note that when searching “Ares” against the Getty site, it retrieves nothing. So we need to include synonyms/equivalents (OR “Mars”) to improve recall. When performing the same search against Google, however, it returns too many hits. So we need to combine the broader term (AND “Major Gods”) to improve precision. This illustrates how important it is for a retrieval system to be flexible and let the user decide how to refine the search according to specific situations.

Page 42: Murtha Baca

Examples of standards for data values: The Getty Vocabularies Library of Congress Name Authority

File (LCNAF) Library of Congress Subject Headings

(LCSH) ICONCLASS

Page 43: Murtha Baca

The Getty Vocabularies

The Getty Vocabularies

Page 44: Murtha Baca

The Getty The Getty VocabulariesVocabularies

Compiled and maintained by the Getty Vocabulary Program

Union List of Artist Names® (ULAN) 117,600 ‘records’; 257,241 names

Art & Architecture Thesaurus® (AAT) 33,150 ‘records’; 128,075 terms

Getty Thesaurus of Geographic Names® (TGN)911,300 ‘records,’1,102,200 names

Focus on the visual arts, architecture, & material culture Are compiled resources (not comprehensive) Grow through contributions May be licensed (vendors of collection management systems, others)

http://www.getty.edu/research/conducting_research/vocabularies/

Page 45: Murtha Baca

Controlled vocabularies:Why bother?

Page 46: Murtha Baca
Page 47: Murtha Baca
Page 48: Murtha Baca
Page 49: Murtha Baca

Αγία Σοφία

Ayasofya

Church of the Holy Wisdom

Hagia Sophia

Haghia Sophia

Saint Sophia

Sancta Sophia

St. Sophia

Page 50: Murtha Baca

Constantinople

Constantinopolis

Costantinopoli

Estambul

Istanbul

Konstantinopel

New Rome

Mikligard

Tsargrad

Tsarigrad

names from Getty Thesaurus of Geographic Names (TGN)

Page 51: Murtha Baca

deposit slip/deposit ticket =paying-in slip

confirmation chit = receipt, deposit receipt

Page 52: Murtha Baca

= cargo shorts

= board shorts

Page 53: Murtha Baca
Page 54: Murtha Baca
Page 55: Murtha Baca

desk?

cartonnier?

chest?

cabinet?

Page 56: Murtha Baca

dolls?

figurines?

statuettes?

idols?

carvings?

sculptures?

Page 57: Murtha Baca

Giambologna?

Giovanni da Bologna?

Jean de Boulogne?

• Users may call the same artist by various names• Items have been catalogd using different names for the same artist

Page 58: Murtha Baca

• published misspellings provide access points

NAMES:O’Keeffe, Georgia Georgia O’KeeffeO’Keefe, Georgia Stieglitz, Alfred, Mrs.

Georgia O'KeefeRam's Skull With Brown LeavesRoswell Museum and Art CenterRoswell, New Mexicofrom: http://www.roswellmuseum.org/

Common misspellings

Page 59: Murtha Baca

Anonymous artist, later named

• former appellations• name is now known

NAMES:Bulgarini, BartolomeoBartolomeo BolgariniBulgarini da Siena,

BartolommeoLorenzetti, UgolinoMaster of the Ovile

MadonnaOvile Master

The Crucifixion, mid 1300s, tempera on wood,The Hermitage (St. Petersburg, Russia)

image from http://sunserv.kfki.hu/~arthp/html/l/lorenzet/ugolino/index.html

Page 60: Murtha Baca

Database issues

• repeating vs. non-repeating fields

• vocabulary-controlled vs. free-text fields (for indexing vs. display)

• “built-in thesauri”; vocabulary-assisted searching OR

• addition of broader terms, variants, at record level

Page 61: Murtha Baca

If we use terms from a standard source such as LCSH or the AAT, why do we need our own “local” authority file(s)?

Page 62: Murtha Baca

Why do we need local authorities?

Local authorities can provide terms not found in published authorities, including non-expert and even “wrong” terms and names.An authority record can remind the cataloger/indexer/abstractor of policies regarding local usage of the term. An authority record can contain relevant/appropriate variant names for the term and identify the one that is preferred and used by the project or institution.

Page 63: Murtha Baca

What about social tagging

and folksonomies?

Page 64: Murtha Baca

In the context of the Web, the act of associating terms (called “tags”) with an information object (e.g. a Web page, an image, a streaming video clip), thus describing the item and enabling keyword-based classification and retrieval. Tags – a form of user-generated metadata – from communities of users can be aggregated and analyzed, providing useful information about the collection of objects with which the tags have been associated.

tagging

Page 65: Murtha Baca

The decentralized practice and method by which individuals and groups create, manage, and share terms, names, etc. (called “tags”) to annotate and categorize digital resources in an online “social” environment. A folksonomy is the result of social tagging. Also referred to as collaborative tagging, social classification, social indexing, mob indexing, folk categorization.

social tagging

Page 66: Murtha Baca

An orderly classification that explicitly expresses the relationships, usually hierarchical (e.g., genus/species, whole/part, class/instance), between and among the things being classified.

taxonomy

Page 67: Murtha Baca

An assemblage of concepts, represented by terms and names (called “tags”), the result of social tagging. A folksonomy is not a taxonomy.

folksonomy

Page 68: Murtha Baca

Vocabularies in the Corporate World

Page 69: Murtha Baca

Disney Titles (preferred forms)

• One Hundred and One Dalmatians

• 101 Dalmatians• 101 Dalmatians II: Patch’s

London Adventure• 101 Dalmatians, Disney’s• 101 Dalmatians: Escape from De

Vil Manor• Sing Along Songs: Disney’s: 101

Dalmatians – Pongo & Perdita• 101 Dalmatians Holiday Art

Page 70: Murtha Baca

Disney Variants

One Hundred and One DalmatiansOne Hundred and One Dalmations One Hundred and One Dalmatians (animated)101 Dalmations (animated feature film)101 Dalmations

101 DalmatiansOne Hundred and One Dalmatians (live action)One Hundred and One Dalmations (live action)One Hundred and One Dalmations (live-action

feature)101 Dalmations

Page 71: Murtha Baca

What’s in a name? That which we call a rose

By any other name would smell as sweet.

Shakespeare, Romeo and Juliet, Act II, scene ii

Page 72: Murtha Baca

Murtha BacaMurtha Baca

Head, Vocabulary ProgramHead, Vocabulary Program

Getty Research InstituteGetty Research Institute

[email protected]

http://getty.edu/research/conducting_research/http://getty.edu/research/conducting_research/vocabularies/vocabularies/