CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe...
-
Upload
sharleen-hancock -
Category
Documents
-
view
214 -
download
0
Transcript of CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe...
CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA
11
Towards a digital Towards a digital content management content management
systemsystemGheorghe Sebestyen-Pal, Tünde Bálint, Gheorghe Sebestyen-Pal, Tünde Bálint,
Bogdan Moscaliuc, Agnes Sebestyen-PalBogdan Moscaliuc, Agnes Sebestyen-Pal
Technical University of Cluj-NapocaTechnical University of Cluj-Napoca Department of Computer Science, Department of Computer Science,
CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA
22
ContentContent
IntroductionIntroduction Ontological approach towards digital library Ontological approach towards digital library
(DL) design(DL) design Requirements for DLsRequirements for DLs A DL model for scientific and technical A DL model for scientific and technical
purposespurposes Information retrieval in DLsInformation retrieval in DLs ConclusionsConclusions
CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA
33
Digital Content Management Systems Digital Content Management Systems and Digital Librariesand Digital Libraries
Historical perspectiveHistorical perspective Information gathering and preservation – an important attribute Information gathering and preservation – an important attribute
of any civilization of any civilization A measure of the civilization level A measure of the civilization level
Digital librariesDigital libraries Not only digitized form of classical librariesNot only digitized form of classical libraries A cooperation and communication environmentA cooperation and communication environment
Digital Content management systems:Digital Content management systems: Systems responsible for: Creation, Storage and Access to Systems responsible for: Creation, Storage and Access to
relevant informationrelevant information It serves a community and/or a purpose (a project, a company, a It serves a community and/or a purpose (a project, a company, a
virtual organization, etc.)virtual organization, etc.) The main goal of a DL (as outlined in the DELOS project)The main goal of a DL (as outlined in the DELOS project)
““to allow any users transparent access to all the digital content to allow any users transparent access to all the digital content anytime from anywhere in an efficient, effective and consistent anytime from anywhere in an efficient, effective and consistent way”way”
CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA
44
Ontology for technical and scientific Ontology for technical and scientific purposespurposes
Ontology:Ontology: Concepts and relationsConcepts and relations Intelligent reasoning and constrainsIntelligent reasoning and constrains
Organizing a DL on ontology basis:Organizing a DL on ontology basis: For interoperability and flexible data exchangeFor interoperability and flexible data exchange For higher quality in information retrievalFor higher quality in information retrieval
Concepts:Concepts: Digital library Digital library
a collection of digital content dedicated for a well defined purpose a collection of digital content dedicated for a well defined purpose and to which a number of users (actors) and specific functionalities and to which a number of users (actors) and specific functionalities are associatedare associated
dynamically created, modified and deleted in accordance with a dynamically created, modified and deleted in accordance with a given goal or purpose given goal or purpose
It serves a given community of users organized in virtual It serves a given community of users organized in virtual organizations organizations
CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA
55
ConceptsConcepts Digital objectDigital object
Association of content (essence) and metadata (data about content)Association of content (essence) and metadata (data about content) The elementary data preservation entityThe elementary data preservation entity It may contain information in different formats (text, image, video, etc.)It may contain information in different formats (text, image, video, etc.)
CollectionCollection Association of digital objects based on a given criterion or purpose (e.g. Association of digital objects based on a given criterion or purpose (e.g.
project, conference, course)project, conference, course) It may also contain other collectionsIt may also contain other collections Note: a digital object may be part of a number of collectionsNote: a digital object may be part of a number of collections
Virtual organizationVirtual organization A community of users associated with a digital libraryA community of users associated with a digital library Users that have a common goal and share common resources in order to fulfill Users that have a common goal and share common resources in order to fulfill
the goalthe goal Users have different roles and access rights (create, read, modify, delete digital Users have different roles and access rights (create, read, modify, delete digital
objects)objects) MetadataMetadata
Define different aspects of digital content:Define different aspects of digital content: descriptive metadata (keywords, topics, ID)descriptive metadata (keywords, topics, ID) Structural metadata (internal organization of the data)Structural metadata (internal organization of the data) Administrative metadata (access rights, quality control, )Administrative metadata (access rights, quality control, )
Used for efficient data search, indexing and retrievalUsed for efficient data search, indexing and retrieval
CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA
66
Concepts and relations for the technical and Concepts and relations for the technical and scientific domainscientific domain
Project:Project: A collection of digital objects:A collection of digital objects:
Documents needed as support for the project (reference documents: books, articles, Documents needed as support for the project (reference documents: books, articles, standards, etc.)standards, etc.)
Documents dynamically created during the project (technical or scientific documents) Documents dynamically created during the project (technical or scientific documents) A set of users (team members) grouped in a virtual organizationA set of users (team members) grouped in a virtual organization A common goalA common goal
Course:Course: A collection of teaching materials (electronic books, presentations, exercises and A collection of teaching materials (electronic books, presentations, exercises and
laboratory works)laboratory works) Teaching staff (course responsible, assistants, PhD students, etc.) and students, Teaching staff (course responsible, assistants, PhD students, etc.) and students,
with different access rightswith different access rights Automated services for documents’ upload and publication.Automated services for documents’ upload and publication.
Events: Conference, Workshop, seminarEvents: Conference, Workshop, seminar A collection of articlesA collection of articles A set of presentation and administrative materials (organizing committees, web-A set of presentation and administrative materials (organizing committees, web-
portal, accommodation and travel information, etc.)portal, accommodation and travel information, etc.) A set of participantsA set of participants
A digital object may be part of a number of structured entities: A digital object may be part of a number of structured entities: e.g. an article may be the result of a project, it may be included into the e.g. an article may be the result of a project, it may be included into the
proceedings of a conference and it may be reference material for a courseproceedings of a conference and it may be reference material for a course
CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA
77
RelationsRelations
CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA
88
Standards and communication Standards and communication protocolsprotocols
http://mapageweb.umontreal.ca/turner/meta/english/metamap.html
CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA
99
Standards and communication Standards and communication protocolsprotocols
MARC (MAchine Readable Cataloging) MARC (MAchine Readable Cataloging) promoted by the Library of Congresspromoted by the Library of Congress Used to exchange bibliographic information between Used to exchange bibliographic information between
libraries libraries Dublin Core metadata Dublin Core metadata
Standard for simplified metadata exchange Standard for simplified metadata exchange Z39.50 Z39.50
defines a protocol for client-server based information defines a protocol for client-server based information retrieval retrieval
The Open Archives Initiative (OAI) The Open Archives Initiative (OAI) a technical framework with client-driven interaction. The a technical framework with client-driven interaction. The
protocol supports interaction between a data provider and protocol supports interaction between a data provider and a service provider a service provider
CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA
1010
Requirements for Requirements for Digital Content Management systemsDigital Content Management systems
Functional requirements:Functional requirements: Content submission (upload)Content submission (upload) Content storage: distributed, replicated,Content storage: distributed, replicated, Indexing and cataloging (based on metadata)Indexing and cataloging (based on metadata) Content search and retrieval Content search and retrieval
Based on metadataBased on metadata Based on full-text searchBased on full-text search
Users managementUsers management Access control and authorizationAccess control and authorization Content annotation and classificationContent annotation and classification Data processing servicesData processing services
Architectural requirements:Architectural requirements: Distribution of resources, services and usersDistribution of resources, services and users Transparent access to remote content (including other DL Transparent access to remote content (including other DL
resources)resources) Management of QoSManagement of QoS
CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA
1111
A digital library model for scientific and A digital library model for scientific and technical purposestechnical purposes
User Interfaces
OAI Data Provider(content
harvesting)Metadata
ManagementContent
Management
User & Virtual Organization Management
Search EngineSecurity
Management
Presentation Layer
Business Logic Layer
Query ProcessorHistory Recorder
Ontology Metadata (SQL)
GRID infrastructureSE &SRM
Repository
Storage and communicati
onLayer
CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA
1212
Information search and Information search and retrievalretrieval
Content search and retrieval:Content search and retrieval: Based on metadata – DB techniquesBased on metadata – DB techniques Based of full-text analysisBased of full-text analysis
Full-Text search:Full-Text search: Key-word searchKey-word search Semantic Information Retrieval (e.g. documents with semantic Semantic Information Retrieval (e.g. documents with semantic
annotations, semantic graphs, etc.)annotations, semantic graphs, etc.) Non-semantic Information Retrieval (e.g. probabilistic Non-semantic Information Retrieval (e.g. probabilistic
matching)matching) Processing sequence:Processing sequence:
Format conversion (DOC, PDF into TXT)Format conversion (DOC, PDF into TXT) Document parsing – rule-based key-words extractionDocument parsing – rule-based key-words extraction Heuristics for relevance processing (probabilistic, distance, Heuristics for relevance processing (probabilistic, distance,
semantic graphs, etc.)semantic graphs, etc.) ““Query by example” Query by example”
CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA
1313
Non-semantic Information Non-semantic Information RetrievalRetrieval Naive Bayes Algorithm Naive Bayes Algorithm
Allows classification of new (unlabeled) documents based on learning Allows classification of new (unlabeled) documents based on learning document (labeled) sets document (labeled) sets
The algorithm determines the probability of words being related to a The algorithm determines the probability of words being related to a given topicgiven topic
Problems: Problems: does not treat the problem of similar words does not treat the problem of similar words words are considered independent of their context (“naïve Bayes”)words are considered independent of their context (“naïve Bayes”)
Topic-Based Vector Space Model Algorithm Topic-Based Vector Space Model Algorithm Treats the problem of similar words (synonyms are replaced)Treats the problem of similar words (synonyms are replaced) The steam of words are considered The steam of words are considered The algorithm associates a vector for every relevant wordThe algorithm associates a vector for every relevant word The similarity between 2 words is computed as the scalar product The similarity between 2 words is computed as the scalar product
between the two associated vectors; between the two associated vectors; A document vector is computed as a weighted sum of the containing A document vector is computed as a weighted sum of the containing
words’ vectorswords’ vectors We proposed an automatic weight computation based on the relevance We proposed an automatic weight computation based on the relevance
of a word to a given topic: of a word to a given topic: According to the proposed method the weight of a vector is computed According to the proposed method the weight of a vector is computed
as a function of its appearance frequency in the processed documentsas a function of its appearance frequency in the processed documents
CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA
1414
ConclusionsConclusions
The paper presents a new vision on the design and The paper presents a new vision on the design and implementation of digital content management implementation of digital content management system. system.
The proposed ontology-based DL allows better The proposed ontology-based DL allows better content organization and retrievalcontent organization and retrieval
The model was implemented on a GRID infrastructureThe model was implemented on a GRID infrastructure As search and information retrieval two algorithms As search and information retrieval two algorithms
were implemented and tested.were implemented and tested. The naïve Bayes algorithm is faster but it is not context The naïve Bayes algorithm is faster but it is not context
awareaware The Topic-Based Vector Space Model Algorithm requires more The Topic-Based Vector Space Model Algorithm requires more
processing time and more interaction from the user, but the processing time and more interaction from the user, but the quality of the results is higherquality of the results is higher