Metadata for DL Metadata Architecture for Digital Libraries: Conceptual Framework for Indian Digital...
-
Upload
bernard-hoover -
Category
Documents
-
view
217 -
download
1
Transcript of Metadata for DL Metadata Architecture for Digital Libraries: Conceptual Framework for Indian Digital...
Metadata for DL
Metadata Architecture for Digital Libraries:
Conceptual Framework for Indian Digital Libraries
Madhusudana Rao CR
C-DAC, Bangalore.
Metadata for DL
Agenda
• Introduction
• Metadata
• Digital Library Architecture– SODA – STARTS
• Indian Digital Library– Background
Metadata for DL
Agenda
– Proposed Architecture– SODA & STARTS
• Conclusion
Metadata for DL
Exclude
• Search Engines - General
• Digital Library - General
Metadata for DL
Introduction
• Information Processing & Retrieval– Typical Library Environment– Library Automation– Networking of Libraries– Digital Library– Digital Library initiatives
Metadata for DL
Introduction
• Digital Library Scene– Search Engines
• Heterogeneous
• Vertical Information Retrieval
• Unique User Interface
• Search engines are different
• Protocols are different
• Querying & Ranking
• Incompatible across the sources
Metadata for DL
Introduction
– Possible solutions• Identifying the User Group
• Identifying the Information Sources
• Negotiating with different Information Sources
• Resource Description Format
• Choose best Information Source to evaluate Query
• Evaluate the query at these sources
• Merge the Query Results from these sources
Metadata for DL
New Protocol
• User
• User Query
• Information Source
• Networked Environment
• RDF Metadata
• User Interface
• Search & Retrieval
Metadata for DL
Issues..
• Metadata
• Network Protocols
• Possible Solutions for typical environment
Metadata for DL
Metadata…definition
Structured data about data...
Metadata for DL
Metadata…definition
• Data that helps in design, create, describe, preserve and use of information systems and resources is Metadata.
• Metadata can play in the development of effective, authoritative, interoperable, scaleable, and preservable information and record keeping systems.
Metadata for DL
Metadata…means
• Information Resource
• Library Catalogue– Index, Abstracts, Catalog Records, etc >
MARC, AACR, LCSH etc.
• Human Generated Textual description
• Machine generated data
Metadata for DL
• Content– Intrinsic
• What it contains?
• What is about?
• Context– Extrinsic
• Who, What, Why, Where, How etc.
• Structure– Formal Set
Metadata….features
Metadata for DL
Metadata…Attributes
• Intrinsic– Subject, Title, Author, Publisher, Publication
place, Other agent, Date, Object type, Form - Identifier, Relation, Source, Language, Coverage, Abstract, Version, Notes, Signature, Classification, keyword
Metadata for DL
Metadata…Attributes
• Extrinsic– System Requirement, Mode of access,
Availability, Cost, Control, Extent, Encoding description, Revision description
Metadata for DL
Metadata…for two communities
• Information Generators
• Librarians / Cataloguers
Metadata for DL
Metadata… can be
• Information Objects– Physical– Intellectual Form
Metadata for DL
Metadata…similar
• Typical Physical Library:– Catalogue – Book Racks– Books
Metadata for DL
Metadata…currently
• Electronic Information Environment– Users search Metadata– Pointers – Primary Information available on computer
display
• Distinction– Electronic Environment
Metadata for DL
Metadata…process
Two Communities
Generators Of information
Libraries & Cataloguers
User’s
Metadata
Metadata for DL
• Need not be Digital
• More than description of an object
• Come from variety of sources
• Continue to accrue
• One’s object Metadata can be another information object’s metadata
Metadata…can be
Metadata for DL
Metadata…can be
• Intermediate steps to retrieve content
• Surrogates of objects
Metadata for DL
Metadata… need
• Internet & WWW witnessed exponential growth
• Need of the hour in the internet is catalogs of some kind
• Internet/WWW is not designed to catalog the contents
Metadata for DL
Metadata…need
• Resource Description is a Challenge
• Tools are available
• Just directories listing of network resources and search engines
• Metadata is one of the solutions
• Again Standards are yet to make its impact
Metadata for DL
Metadata…issues
• Increased accessibility– Searching > existence of rich and consistent
metadata– search across multiple collections– Distributed across several repositories
Metadata for DL
Metadata…issues
• Retention of Text– Collection of objects– Complex interrelationships with people, places,
movements & events– Documenting and maintaining those
relationships– authenticity, structural and procedural integrity
Metadata for DL
Metadata…issues
• Expanding use– Disseminating digital versions – Geography– Economics– Infinite ways to search information– Retrieve to wider community
Metadata for DL
Metadata…issues
• Multi-versioning– variant versions– High resolution copy for preservation– Low resolution copy for thumbnail image for
quick reference and network transfers
Metadata for DL
Metadata…issues
• Legal Issues– Track many layers of rights and reproduction
information – Privacy– Proprietary interests
Metadata for DL
Metadata…issues
• Preservation– Generations - H/W & S/W– Technical, Descriptive and Preservation data – Information objects to remain accessible and
intelligible over time
Metadata for DL
Metadata…issues
• System improvement and economics– Benchmarking– Planning new systems
Metadata for DL
Metadata..life cycle
Organization
Searching & Retrieval
Utilization
Preservation &Disposition
Creation & MultiVersioning
Metadata for DL
Metadata…standards
• In order Metadata to be useful & cost-effective it is essential– Structure, Semantics and Syntax conforms to
standards– Capture essence of sources– Distributed metadata model
Metadata for DL
Metadata…standards
• There is no single international standard for Metadata
• Different levels - complexity, richness to simple formats
• Several metadata schemes has been proposed for different levels of requirements
Metadata for DL
Metadata…standards
• IAFA templates
• WWW semantic header
• URS (Uniform Resource Citation)
• OCLC InterCat project
• TEI (Text Encoding and Interchange)
• Search engine meta tags
• Resource Description Framework
• EAD (Encoding Archival Description)
• GILS (Govt Information Locator Service)
• Federal Geographic Data Committee
• Museum Educational Site Licensing Project
• Dublin Core
Metadata for DL
Dublin Core
Because it is simple…….. Yet effective ….
Metadata for DL
Dublin Core..means
• Dublin, Ohio
• International consensus meetings, workshops, etc
• Emerging Infrastructure for Internet
• Support Resource Discovery
• Elements represent a broad interdisciplinary consensus
• Core set of elements
Metadata for DL
Dublin Core..standard
• Comprises of 15 core elements
• Consensus by an International, Cross-disciplinary group representing– Library & Information – Computer Science– Text Encoding– Museum– Related fields of scholarship
Metadata for DL
Dublin Core..standard
• Each 15 elements are optional and repetitive
• Each element has a limited set of qualifiers and attributes
• Simple DC
• Qualified DC
Metadata for DL
Dublin Core..goals
• Simplicity of creation & Maintenance– Non-specialist to create descriptive records for
effective retrieval in an networked environment
• Commonly understood semantics– Digital tourist for non specialist searcher– Convergence of common, more generic
elements– increasing visibility and accessibility
Metadata for DL
Dublin Core..goals
• International scope– 20 languages– Coordinating efforts– RDF - WWW
• Technical challenges of Internationalization– Multilingual & Multicultural nature of
electronic information universe
Metadata for DL
Dublin Core..goals
• Extensibility– Additional resource discovery needs
Metadata for DL
Dublin Core..elements
• Content– Coverage, Description, type, relation, source,
subject and title
• Intellectual property– Contributor, Creator, Publisher & Rights
• Instantiation– Date, Format, Identifier & Language
Metadata for DL
Dublin Core..implementation
• Dublin Core web site lists 15 North America and Mexico in Europe and 12 Asia and Australia
Metadata for DL
Digital Library Architecture
• SODA (Smart Objects Dumb Archives)
• STARTS (Stanford Protocol proposal for Internet Retrieval and Search)
Metadata for DL
Digital Library
• Digital Library Services– User
• Functionality & Interface
– Searching– Browsing
• Archive– Managed sets of objects
Metadata for DL
Digital Library
• Digital Object– Stored and trafficked digital content
• Simple files,
• Sophisticated objects
Metadata for DL
Digital Library
Digital Library Services
Archive 1 Archive 2 Archive N
Digital Library Service Providers
Digital Objects in Archives
Publishers
Library Users
Digital Objectsout of Archives
Metadata for DL
Digital Library.. builds
• Identifying a user group
• Identifying archives holding information of interest
• Negotiating terms and conditions with publishing
• Creating Indices
• Services such as Search & Browse
Metadata for DL
Digital Library.. builds
• Creating User interaction services– Terms & Conditions– Authentication– Billing– Display
Metadata for DL
Digital Library.. hindered
• Interoperability
• Object mobility
• Complex archives
Metadata for DL
Digital Library..cons
• Digital Libraries are partitioned– Discipline - Computer Science, Aeronautics,
Physics, etc.– Format - Technical reports, video, software, etc.
• Interdisciplinary search difficult
• Resource Description includes manuscripts, software, data sets etc.
Metadata for DL
Digital Library..cons
• Manuscripts Vs Other objects - Reintegration
• All digital storage and transmission, tight integration
Metadata for DL
SODA…background
• Information generated in several forms
• Differentiated by semantic types (report, software, video, data sets etc.)
• Given semantic representation differentiated by syntactic representation (PS, PDF, Word)
• Media boundaries exists
Metadata for DL
SODA…addresses
• Archive-independent container construct
• All semantic and syntactic data types
• Objects that logically grouped together
• Archived & manipulated as a single object
• Several objects can communicate with each other
• Arbitrary network services
Metadata for DL
SODA..addresses
• Traditional functionality associated with archives has been pushed down into objects
• Making objects smarter/increase the responsibility
• Archives dumber/decrease the responsibility
Metadata for DL
SODA
• Archives exists to assist the user to locate the objects
• Once the object is found user directly interact with the objects
Metadata for DL
Smart Objects.. illustration
Smart objects
DumbArchives
Smart Archives Dumb Archives
SOSA: Smart objects, Smart ArchivesEx: none
SODA: Smart ObjectsDumb ArchivesEx: NCSTRL+
DOSA: Dumb ObjectsSmart ArchivesEx: NCSTRL
DODA: Dumb objectsDumb ArchivesEx: FTP server
Metadata for DL
SODA Model…implementation
Metadata for DL
Buckets..containers
• Object oriented containers• Logically grouped items are
– Collected– Stored– Transported as a single unit
• Many forms of same data• Related & non traditional data (Supportive
material)
Metadata for DL
Buckets.. containers
• Multiple packages
• Packages can corresponds semantics– manuscript, software etc.– metadata– terms and conditions– pointers
• Single package can have several items
Metadata for DL
Bucket..architecture
Terms and Conditions
Metadata (RFC 1807, Dublin Core)
Manuscript.ps, .pdf, .tex, .doc
Software.tar,.c, .java, .asp
Images.gif, .jpg
Data sets.xls, .tar
Packages inside the bucket Element
s inside the package
Access Methods
Handle (unique ID)
Metadata for DL
Bucket…requirements
• Unique ID - handle
• Either standalone or multiple repositories
• Standalone - WWW through TCP/IP
• Moderation of number of buckets through intelligence and functionality
• Individual buckets may have custom terms and conditions
Metadata for DL
Buckets..characteristics
• Is of arbitrary size
• Globally unique ID
• 0 or more components called packages
• Package contains 1 or more components - elements
• Element can be a file or pointer
• Packages and elements can be other buckets
Metadata for DL
Buckets..characteristics
• Package can be a pointers to a remote bucket, another package or element
• Buckets can keep internal logs of actions
• Interactions or communication between buckets are made only through defined methods
• Buckets can initiate actions, they do not have to wait to be acted on
Metadata for DL
Traditional Vs Bucket repository
Repository Interface Repository Interface
intelligence Optional intelligence
Archived objects Archived Buckets
Bucketextractionprocedure
User User
Metadata for DL
Buckets..protocol
Index holdingsSearch/retrieve
holdings
Display holdingsbucket
Archive
User
Metadata for DL
Bucket..Tools
• Author Tool– Metadata– Adds packages– Adds elements to package– Selects applicable clusters– Terms and conditions
Metadata for DL
Bucket..Tools
• Management Tool– Interface – Query and update buckets
• Bucket Matching System– SDI– Find similar works by different authors– Arbitrary SDI– Metadata scrubbing
Metadata for DL
Buckets..implementation
• NCSTRL
• NCSTRL+
Metadata for DL
STARTS
• Stanford Digital Library Project
• Search Engine Vendors
Metadata for DL
STARTS
• Document Sources– Internal networks– Internet
• Source Contents– Hidden behind search interfaces
• Algorithms/Protocols are different
Metadata for DL
STARTS..Architecture
Metadata for DL
STARTS..Architecture
• Large Number of resources
• Each resource consist one or more sources
• Source is collection of files
• Accepts queries from clients and produces results
• Sources may be small or large
• Extract the source list from resources periodically
Metadata for DL
STARTS..Architecture
• Extract Metadata and content summaries from source periodically
• Query to a source to a resource
• Communicate with promising resources
• Results are from multiple sources, merge them & retrieve them to the user
Metadata for DL
STARTS..Query language
• Filter expression– Boolean nature– Defines documents
• Ranking expression– Associates score with documents
Metadata for DL
STARTS..Query language
• L-strings– language-country– string behavior
• Atomic Terms– Fields– Modifiers
• Complex filter expression– and, or, and-not, prox etc
Metadata for DL
STARTS..Query language
• Complex ranking expressions
• Global settings
Metadata for DL
STARTS..Merging ranks
• Unnormalized score of the document for each query
• ID of the sources where document appears
• Statistics– Term-frequency, Term-weight, Document-
frequency, Document-size, Document-count
Metadata for DL
STARTS..Source metadata
• Properties of the source– Fields supported, score range, linkage etc.
• Content Summary of the source– List of words that appear in the source– statistics of each word listed– total documents in the list etc.
Metadata for DL
STARTS..in the end
• General Search Engines– Gathers all documents on the network
• STARTS– Gathers metadata about collections– Selects small set of collections– Search & retrieve
Metadata for DL
STARTS..implementation
• Alexandria Digital Library
Metadata for DL
STARTS..limitation
• Text only
Metadata for DL
Indian Digital Library..
• Ancient & Diverse culture
• 5000 years old culture
• Largest Democracy
• Seventh largest country
• High population
• Illiterate
• Important part of World Economy
Metadata for DL
Indian Digital Library..
• World’s largest middle class
• Poverty
• Highly skilled manpower
• Generates Research Oriented Information
• Global interest
• Major players in IT in the World
• World is looking for ancient Indian Culture
Metadata for DL
Indian Scene..IT
• Content is lacking
• Indian Literature control (both bibliographic and full text)in almost all fields are sketchy
• NII
• DL on Indian Heritage
• World Wide accord for Indian Heritage
• Internet Religion is the hot attraction
Metadata for DL
Indian Scene.. IT
• West Research has been done on Veda, Upanishads, Shastra, Philosophy etc. but soul is missing
• Protection, Preservation, Study, Research, Propagation for posterity
• NLP
• Knowledge Presentation
Metadata for DL
Indian Scene.. IT
• Speech recognition
• OCR
• Machine translation
• NL interfaces
• Text Processing through Index, Concordance, Thesauri, Dictionaries
Metadata for DL
Indian Scene.. IT
• National Integration, Guide Humanity, Conflicts, Aberrations, intolerance etc
• Value based system
• Historic priceless manuscripts
Metadata for DL
Indian Heritage
• Indian Art
• Indian Paintings
• Indian Sculpture
• Religion
Metadata for DL
Proposed Architecture….
• Background– User Group
• Skilled & Illiterates
• Oral tradition still exists
• Multilingual
– Information Sources• Content is lacking
• Literature Control both Bibliographic and Text is very weak
Metadata for DL
Proposed Architecture….
• Media– Computer Generated files to Palm leaf manuscripts
• Language
• lack of standards for communication
• Geographical boundaries
• Accessibility
• Reaching rural population
– Publishing• Restricted to regional and local
Metadata for DL
Proposed Architecture….
• National initiates are yet to take off
• Cooperative publishing is lacking
• Unicode/Universal protocol yet make its impact
– Network Resources• Communication infrastructure exists but not stable
• Individuals, Organizations, local, regional are generators of sources
• Loose networks - manpower & infrastructure
• Lack of communication standards
• Duplicate works
Metadata for DL
Proposed Architecture….
– Need of Networked Information Sources• Many priceless knowledge lost or loosing
• Future generation missing the value of life told by ancestors
• Protection, Preservation, Study, Research, Propagation for posterity
– Looking for future• NII
• Better CCC, Computer, Communication, Content
Metadata for DL
Hybrid Architecture….
• Combination of SODA & STARTS Architecture– From SODA - Bucket Architecture– From STARTS - Search and Retrieval protocol
• Metadata - Dublin Core– For its simplicity and popularity
Metadata for DL
Bucket Architecture….
• Buckets are logically grouped– Language, Region, Content, Media, Images,
etc. (any combination or together as intelligent)
• Large archives have buckets with many different functionality's
• Bucket may contain resources, packages, elements, metadata, pointers, etc.
Metadata for DL
Bucket Architecture….
• Bucket may be unique entity or many buckets may form an entity
• Bucket may be standalone with the content
• Many buckets may become resource
• Each bucket has been built with some degree of intelligence and functionality
• Includes author tool and management tool
Metadata for DL
Bucket Architecture….
• Similarly user’s buckets are also created • Bucket matching may take place• Interactions with packages or elements are
made only through defined methods on a bucket
• Bucket can initiate actions• Buckets can exist inside or out of a repository
Metadata for DL
STARTS Architecture….
• Search, Retrieval and Browse within Bucket
• Resources, Sources, Elements, Packages, Pointers, etc. based on the Bucket definition
• Search query is made within the source defined in Bucket
• Query may be within the bucket or across the bucket based on the definition and functionality
Metadata for DL
STARTS Architecture….
• Ranking is done within the source
• Matching is done with User’s Bucket definition
• Results displayed based on Ranking and user’s requirements
• Although STARTS uses Z39.50 for metadata & transfer protocol, we propose to use Dublin Core for metadata
Metadata for DL
New Protocol..
• Need to create standard for communication
• Information processing and retrieval
• Feeling universal information source
• Many sources converge as once resource
• Global information resource
• Universal accessibility by unified protocol
• Global access
Metadata for DL
New Protocol..
• Frame work is just beginning