10/26/2000Information Organization and Retrieval Metadata and Description University of California,...

56
10/26/2000 Information Organization and Retrieval Metadata and Description University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    0

Transcript of 10/26/2000Information Organization and Retrieval Metadata and Description University of California,...

10/26/2000 Information Organization and Retrieval

Metadata and Description

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and Retrieval

10/26/2000 Information Organization and Retrieval

Review

• Organization of Information

• Information Life Cycle

10/26/2000 Information Organization and Retrieval

Course Schedule• Organization

– Overview

– Metadata and Markup

– Controlled Vocabularies, Classification, Thesauri

– Information Design• Thesaurus Design

• Database Design

• Retrieval– The Search Process– Content Analysis

• Tokenization, Zipf’s Law, Lexical Associations

– IR Implementation– Term weighting and

document ranking• Vector space model• Probabilistic model

– User Interfaces• Overviews, query

specification, providing context, relevance feedback

10/26/2000 Information Organization and Retrieval

Why Organize Information?

• The main reason– So that you can find things more effectively

• I.e., Effective retrieval is predicated on some sort of organization applied to information resources

• Historically there have been many institutions and tools devoted to information organization– Libraries– Museums– Archives– Indexes and catalogs, dictionaries, Phone books, etc.

10/26/2000 Information Organization and Retrieval

Information Life CycleCreation

Utilization Searching

Active

Inactive

Semi-Active

Retention/Mining

Disposition

Discard

Using Creating

AuthoringModifying

OrganizingIndexing

StoringRetrieval

DistributionNetworking

AccessingFiltering

10/26/2000 Information Organization and Retrieval

Key issues in this course• How to describe information resources or

information-bearing objects in ways so that they may be effectively used by those who need to use them.– Organizing

• How to find the appropriate information resources or information-bearing objects for someone’s (or your own) needs.– Retrieving

10/26/2000 Information Organization and Retrieval

Key IssuesCreation

Utilization Searching

Active

Inactive

Semi-Active

Retention/Mining

Disposition

Discard

Using Creating

AuthoringModifying

OrganizingIndexing

StoringRetrieval

DistributionNetworking

AccessingFiltering

10/26/2000 Information Organization and Retrieval

Structure of an IR SystemInterest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

10/26/2000 Information Organization and Retrieval

Metadata• Metadata is:

– “data about data” (term usage database systems)– Information about Information– Structures and Languages for the Description of

Information Resources and their elements (components or features)

– “Metadata is information on the organization of the data, the various data domains, and the relationship between them” (Baeza-Yates p. 142)

10/26/2000 Information Organization and Retrieval

Types of Metadata

• Element names.

• Element description.

• Element representation.

• Element coding.

• Element semantics.

• Element classification.

10/26/2000 Information Organization and Retrieval

Today

• Bibliographic Metadata (traditional Library cataloging)

• Other Metadata systems

• Dublin Core

10/26/2000 Information Organization and Retrieval

How can you describe an information-bearing object?

10/26/2000 Information Organization and Retrieval

Bibliographic Information• Describes documents• What is a document (revisited)?• Choice of descriptive elements and content of

those elements typically governed by a set of rules:– AACR II

• Elements coded in standard ways for transmission.– MARC

10/26/2000 Information Organization and Retrieval

Goals of Descriptive Cataloging1. To enable a person to find a document of which

– the author, or– the title, or– the subject is known

2. To show what a library has– by a given author– on a given subject (and related subjects)– in a given kind (or form) of literature.

3. To assist in the choice of a document– as to its edition (bibliographically)– as to its character (literary or topical)

Charles A. Cutter, 1876

10/26/2000 Information Organization and Retrieval

Rules for Descriptive Cataloging

• ISBD

• AACR

• AACR II

10/26/2000 Information Organization and Retrieval

AACRII

• Sources of Information

• ISBD areas

• Choice of Access Points

10/26/2000 Information Organization and Retrieval

Sources of Information

• Each different type of material has a preferred location for deriving information about it.– Books and printed material

• Title page

– Cartographic Materials (Maps, globes, etc)• The map itself, or containers, stands, etc.

– Sound recordings• Disc label, cassette label, etc.

10/26/2000 Information Organization and Retrieval

ISBD Areas• Title and Statement of Responsibility• Edition• Material or type of publication specification• Publication, Distribution (etc.) • Physical Description• Series • Notes• Standard Numbers

10/26/2000 Information Organization and Retrieval

ISBD Punctuation• Title Proper (GMD) = Parallel title : other

title info / First statement of responsibility ; others. -- Edition information. -- Material. -- Place of Publication : Publisher Name, Date. -- Material designation and extent ; Dimensions of item. -- (Title of Series / Statement of responsibility). -- Notes. -- Standard numbers: terms of availability (qualifications).

10/26/2000 Information Organization and Retrieval

Bibliographic Record

• Introduction to cataloging and classification / Bohdan S. Wynar. -- 8th ed. / Arlene G. Taylor. -- Englewood, Colo. : Libraries Unlimited, 1992. -- (Library science text series).

10/26/2000 Information Organization and Retrieval

Choice of Access Points

• Title(s) (Always main title)

• Main Entry??

• Added Entries

• Series Titles

• Identifying Numbers

10/26/2000 Information Organization and Retrieval

More Metadata Systems

• The following are a sample of metadata systems for a variety of special types of data/documents/objects.

10/26/2000 Information Organization and Retrieval

Type of Metadata systems and standards

• Naming and ID systems

• Bibliographic description– Texts

• Music

• Images and objects

• Numeric Data

• Geospatial Data

• Collections

10/26/2000 Information Organization and Retrieval

Naming and ID Systems• URLs (Uniform Resource Locators)

– URIs (Uniform Resource Indentifiers)

• URNs (Uniform Resource Names )

• URCs (Uniform Resource Characteristics)

• Kahn/Wilensky Handles

• SICI (Serial Item and Content Identifiers)

• ISBN

• ISSN

10/26/2000 Information Organization and Retrieval

Bibliographic Description

• MARC (Machine Readable Cataloging)• DUBLIN CORE

– Warwick Framework for Dublin Core Metadata

• GILS (Government Information Locator Service)

• RFC 1807 (Format for Bibliographic Records)

• RDF (Resource Description Format)

10/26/2000 Information Organization and Retrieval

More Bibliographic Descriptors

• TEI Headers (Text Encoding initiative)

• BibTex

• PICS (Platform for Internet Content Selection)

• SOIF (Summary Object Interchange Format)

10/26/2000 Information Organization and Retrieval

Music

• Standard Music Description Language (SMDL)

10/26/2000 Information Organization and Retrieval

Numeric Data

• ICPSR Data Documentation Initiative (SGML DTD development)

• Standard for Survey Design and Statistical Methodology Metadata (SDSM)

10/26/2000 Information Organization and Retrieval

Images and Objects

• Categories for the Description of Works of Art (Getty Art Institute)

• Consortium for the Computer Interchange of Museum Information (CIMI)

• RLG REACH Element Set (for Shared Description of Museum Objects)

• VRA Core Categories (Visual Resources Association)

10/26/2000 Information Organization and Retrieval

Geospatial Data

• Content Standards for Digital Geospatial Metadata

• FGDC (Federal Geographic Data Committee)

• ASTM Section D18.01.05 Draft Specification Content Specification for Digital Geospatial Metadata. (American Society for Testing and Materials (ASTM).

10/26/2000 Information Organization and Retrieval

Collection Level Descriptors

• EAD (Encoded Archival Description)

• Z39.50 Profile for Access to Digital Collections

• RSLP Collection Description (Research Support Libraries Programme)

10/26/2000 Information Organization and Retrieval

Dublin Core

• Simple metadata for describing internet resources.

• For “Document-Like Objects”

• 15 Elements.

10/26/2000 Information Organization and Retrieval

Dublin Core Elements

• Title• Creator• Subject• Description• Publisher• Other Contributors• Date• Resource Type

• Format• Resource Identifier• Source• Language• Relation• Coverage• Rights Management

10/26/2000 Information Organization and Retrieval

Title

• Label: TITLE

• The name given to the resource by the CREATOR or PUBLISHER.

10/26/2000 Information Organization and Retrieval

Author or Creator

• Label: CREATOR

• The person(s) or organization(s) primarily responsible for the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources.

10/26/2000 Information Organization and Retrieval

Subject and Keywords

• Label: SUBJECT • The topic of the resource, or keywords or phrases that

describe the subject or content of the resource. The intent of the specification of this element is to promote the use of controlled vocabularies and keywords. This element might well include scheme-qualified classification data (for example, Library of Congress Classification Numbers or Dewey Decimal numbers) or scheme-qualified controlled vocabularies (such as MEdical Subject Headings or Art

and Architecture Thesaurus descriptors) as well.

10/26/2000 Information Organization and Retrieval

Description

• Label: DESCRIPTION • A textual description of the content of the resource,

including abstracts in the case of document-like objects or content descriptions in the case of visual resources. Future metadata collections might well include computational content description (spectral analysis of a visual resource, for example) that may not be embeddable in current network systems. In such a case this field might contain a link to such a description rather than the description itself.

10/26/2000 Information Organization and Retrieval

Publisher

• Label: PUBLISHER

• The entity responsible for making the resource available in its present form, such as a publisher, a university department, or a corporate entity. The intent of specifying this field is to identify the entity that provides access to the resource.

10/26/2000 Information Organization and Retrieval

Other Contributors• Label: CONTRIBUTORS • Person(s) or organization(s) in addition to

those specified in the CREATOR element who have made significant intellectual contributions to the resource but whose contribution is secondary to the individuals or entities specified in the CREATOR element (for example, editors, transcribers, illustrators, and convenors).

10/26/2000 Information Organization and Retrieval

Date

• Label: DATE• The date the resource was made available in its

present form. The recommended best practice is an 8 digit number in the form YYYYMMDD as defined by ANSI X3.30-1985. In this scheme, the date element for the day this is written would be 19961203, or December 3, 1996. Many other schema are possible, but if used, they should be identified in an unambiguous manner.

10/26/2000 Information Organization and Retrieval

Resource Type

• Label: TYPE • The category of the resource, such as home

page, novel, poem, working paper, preprint, technical report, essay, dictionary. It is expected that RESOURCE TYPE will be chosen from an enumerated list of types. One preliminary set of such types can be found at the following URL (now out of date): http://www.roads.lut.ac.uk/Metadata/DC-ObjectTypes.html

10/26/2000 Information Organization and Retrieval

Format• Label: FORMAT • The data representation of the resource, such as text/html,

ASCII, Postscript file, executable application, or JPEG image. The intent of specifying this element is to provide information necessary to allow people or machines to make decisions about the usability of the encoded data (what hardware and software might be required to display or execute it, for example). As with RESOURCE TYPE, FORMAT will be assigned from enumerated lists such as registered Internet Media Types (MIME types). In principal, formats can include physical media such as books, serials, or other non-electronic media.

10/26/2000 Information Organization and Retrieval

Resource Identifier• Label: IDENTIFIER • String or number used to uniquely identify

the resource. Examples for networked resources include URLs and URNs (when implemented). Other globally-unique identifiers,such as International Standard Book Numbers (ISBN) or other formal names would also be candidates for this element.

10/26/2000 Information Organization and Retrieval

Source

• Label: SOURCE

• The work, either print or electronic, from which this resource is derived, if applicable. For example, an html encoding of a Shakespearean sonnet might identify the paper version of the sonnet from which the electronic version was transcribed.

10/26/2000 Information Organization and Retrieval

Language

• Label: LANGUAGE

• Language(s) of the intellectual content of the resource. Where practical, the content of this field should coincide with the Z39.53 three character codes for written languages. See: http://www.sil.org/sgml/nisoLang3-1994.html

10/26/2000 Information Organization and Retrieval

Relation

• Label: RELATION• Relationship to other resources. The intent of specifying

this element is to provide a means to express relationships among resources that have formal relationships to others, but exist as discrete resources themselves. For example, images in a document, chapters in a book, or items in a collection. A formal specification of RELATION is currently under development. Users and developers should understand that use of this element should be currently considered experimental.

10/26/2000 Information Organization and Retrieval

Coverage

• Label: COVERAGE

• The spatial locations and temporal duration characteristic of the resource. Formal specification of COVERAGE is currently under development. Users and developers should understand that use of this element should be currently considered experimental.

10/26/2000 Information Organization and Retrieval

Rights Management

• Label: RIGHTS • The content of this element is intended to be a link (a URL

or other suitable URI as appropriate) to a copyright notice, a rights-management statement, or perhaps a server that would provide such information in a dynamic way. The intent of specifying this field is to allow providers a means to associate terms and conditions or copyright statements with a resource or collection of resources. No assumptions should be made by users if such a field is empty or not present.

10/26/2000 Information Organization and Retrieval

The Same Item in Different Metadata Systems

• ISBD

• Dublin Core

• RFC 1807

• TEI Header

• MARC Record

10/26/2000 Information Organization and Retrieval

ISBD Punctuation• Title Proper (GMD) = Parallel title : other

title info / First statement of responsibility ; others. -- Edition information. -- Material. -- Place of Publication : Publisher Name, Date. -- Material designation and extent ; Dimensions of item. -- (Title of Series / Statement of responsibility). -- Notes. -- Standard numbers: terms of availability (qualifications).

10/26/2000 Information Organization and Retrieval

Bibliographic Record

• Introduction to cataloging and classification / Bohdan S. Wynar. -- 8th ed. / Arlene G. Taylor. -- Englewood, Colo. : Libraries Unlimited, 1992. -- (Library science text series).

10/26/2000 Information Organization and Retrieval

Dublin Core• TITLE: Introduction to cataloging and classification• CREATOR: Taylor, Arlene G.• OTHER CONTRIBUTOR: Wynar, Bohdan S.• DATE: 1992• FORMAT: BOOK• LANGUAGE: ENG• PAGES: 633• PUBLISHER: Libraries Unlimited• SUBJECT: Cataloging.• SUBJECT: subject cataloging.• SUBJECT: Classification -- Books• DESCRIPTION: Textbook on cataloging and classification• RESOURCE TYPE: text.monograph• RESOURCE IDENTIFIER: (ISBN) 0872879674

10/26/2000 Information Organization and Retrieval

RFC 1807• BIB-VERSION:: CS-TR-v2.1• ID:: UCB//123456• ENTRY:: September 9, 1997• TYPE:: BOOK• TITLE:: Introduction to cataloging and classification• AUTHOR:: Wynar, Bohdan S.• AUTHOR:: Taylor, Arlene G.• DATE:: 1992• PAGES:: 633• COPYRIGHT:: Libraries Unlimited, 1992• SERIES:: Library Science Text Series• END:: UCB//123456

10/26/2000 Information Organization and Retrieval

Minimal TEI Header• <teiHeader>• <fileDesc>• <titleStmt>• <title> Introduction to cataloging and classification</title>• <respStmt><name>Bohdan S. Wynar<resp> 8th edition by</resp>• <name>Arlene G. Taylor</name>• </respStmt>• </titleStmt>• <publicationStmt>• <distributor>Libraries Unlimited</distributor>• </publicationStmt>• <sourceDesc>• <bibl> Introduction to cataloging and classification / Bohdan S. Wynar. -- 8th

ed. / Arlene G. Taylor. -- Englewood, Colo. : Libraries Unlimited, 1992. • </bibl>• </sourceDesc>• </fileDesc>• <teiHeader>

10/26/2000 Information Organization and Retrieval

MARC Record (display)• ID:DCLC9124851-B RTYP:c ST:p FRN: MS:c EL: AD:06-20-91• CC:9110 BLT:am DCF:a CSC: MOD: SNR: ATC: UD:04-11-92• CP:cou L:eng INT: GPC: BIO: FIC:0 CON:b• PC:s PD:1992/ REP: CPI:0 FSI:0 ILC:a II:1• MMD: OR: POL: DM: RR: COL: EML: GEN: BSE:• 010 9124851• 020 0872878112 (cloth)• 020 0872879674 (paper)• 040 DLC$cDLC$dDLC• 050 00 Z693$b.W94 1991• 082 00 025.3$220• 100 1 Wynar, Bohdan S.• 245 10 Introduction to cataloging and classification /$cBohdan S. Wynar.• 250 8th ed. /$bArlene G. Taylor.• 260 Englewood, Colo. :$bLibraries Unlimited,$c1992.• 300 xvii, 633 p. :$bill. ;$c24 cm.• 440 0 Library science text series• 504 Includes bibliographical references (p. 591-599) and index.• 650 0 Cataloging.• 650 0 Subject cataloging.• 650 0 Classification$xBooks.• 630 00 Anglo-American cataloguing rules.• 700 10 Taylor, Arlene G.,$d1941-

10/26/2000 Information Organization and Retrieval

Metadata Resources

• Check the Links section from the class home page

• Best site is the “Digital Library: Metadata Resources” page from IFLA at http://www.ifla.org/ifla/II/metadata.htm