CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe...

14
CONTI’2008, 5-6 June 20 CONTI’2008, 5-6 June 20 08, TIMISOARA 08, TIMISOARA 1 Towards a digital Towards a digital content management content management system system Gheorghe Sebestyen-Pal, Tünde Bálint, Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal Bogdan Moscaliuc, Agnes Sebestyen-Pal Technical University of Cluj-Napoca Technical University of Cluj-Napoca Department of Computer Science, Department of Computer Science,

Transcript of CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe...

Page 1: CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA

11

Towards a digital Towards a digital content management content management

systemsystemGheorghe Sebestyen-Pal, Tünde Bálint, Gheorghe Sebestyen-Pal, Tünde Bálint,

Bogdan Moscaliuc, Agnes Sebestyen-PalBogdan Moscaliuc, Agnes Sebestyen-Pal

Technical University of Cluj-NapocaTechnical University of Cluj-Napoca Department of Computer Science, Department of Computer Science,

Page 2: CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA

22

ContentContent

IntroductionIntroduction Ontological approach towards digital library Ontological approach towards digital library

(DL) design(DL) design Requirements for DLsRequirements for DLs A DL model for scientific and technical A DL model for scientific and technical

purposespurposes Information retrieval in DLsInformation retrieval in DLs ConclusionsConclusions

Page 3: CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA

33

Digital Content Management Systems Digital Content Management Systems and Digital Librariesand Digital Libraries

Historical perspectiveHistorical perspective Information gathering and preservation – an important attribute Information gathering and preservation – an important attribute

of any civilization of any civilization A measure of the civilization level A measure of the civilization level

Digital librariesDigital libraries Not only digitized form of classical librariesNot only digitized form of classical libraries A cooperation and communication environmentA cooperation and communication environment

Digital Content management systems:Digital Content management systems: Systems responsible for: Creation, Storage and Access to Systems responsible for: Creation, Storage and Access to

relevant informationrelevant information It serves a community and/or a purpose (a project, a company, a It serves a community and/or a purpose (a project, a company, a

virtual organization, etc.)virtual organization, etc.) The main goal of a DL (as outlined in the DELOS project)The main goal of a DL (as outlined in the DELOS project)

““to allow any users transparent access to all the digital content to allow any users transparent access to all the digital content anytime from anywhere in an efficient, effective and consistent anytime from anywhere in an efficient, effective and consistent way”way”

Page 4: CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA

44

Ontology for technical and scientific Ontology for technical and scientific purposespurposes

Ontology:Ontology: Concepts and relationsConcepts and relations Intelligent reasoning and constrainsIntelligent reasoning and constrains

Organizing a DL on ontology basis:Organizing a DL on ontology basis: For interoperability and flexible data exchangeFor interoperability and flexible data exchange For higher quality in information retrievalFor higher quality in information retrieval

Concepts:Concepts: Digital library Digital library

a collection of digital content dedicated for a well defined purpose a collection of digital content dedicated for a well defined purpose and to which a number of users (actors) and specific functionalities and to which a number of users (actors) and specific functionalities are associatedare associated

dynamically created, modified and deleted in accordance with a dynamically created, modified and deleted in accordance with a given goal or purpose given goal or purpose

It serves a given community of users organized in virtual It serves a given community of users organized in virtual organizations organizations

Page 5: CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA

55

ConceptsConcepts Digital objectDigital object

Association of content (essence) and metadata (data about content)Association of content (essence) and metadata (data about content) The elementary data preservation entityThe elementary data preservation entity It may contain information in different formats (text, image, video, etc.)It may contain information in different formats (text, image, video, etc.)

CollectionCollection Association of digital objects based on a given criterion or purpose (e.g. Association of digital objects based on a given criterion or purpose (e.g.

project, conference, course)project, conference, course) It may also contain other collectionsIt may also contain other collections Note: a digital object may be part of a number of collectionsNote: a digital object may be part of a number of collections

Virtual organizationVirtual organization A community of users associated with a digital libraryA community of users associated with a digital library Users that have a common goal and share common resources in order to fulfill Users that have a common goal and share common resources in order to fulfill

the goalthe goal Users have different roles and access rights (create, read, modify, delete digital Users have different roles and access rights (create, read, modify, delete digital

objects)objects) MetadataMetadata

Define different aspects of digital content:Define different aspects of digital content: descriptive metadata (keywords, topics, ID)descriptive metadata (keywords, topics, ID) Structural metadata (internal organization of the data)Structural metadata (internal organization of the data) Administrative metadata (access rights, quality control, )Administrative metadata (access rights, quality control, )

Used for efficient data search, indexing and retrievalUsed for efficient data search, indexing and retrieval

Page 6: CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA

66

Concepts and relations for the technical and Concepts and relations for the technical and scientific domainscientific domain

Project:Project: A collection of digital objects:A collection of digital objects:

Documents needed as support for the project (reference documents: books, articles, Documents needed as support for the project (reference documents: books, articles, standards, etc.)standards, etc.)

Documents dynamically created during the project (technical or scientific documents) Documents dynamically created during the project (technical or scientific documents) A set of users (team members) grouped in a virtual organizationA set of users (team members) grouped in a virtual organization A common goalA common goal

Course:Course: A collection of teaching materials (electronic books, presentations, exercises and A collection of teaching materials (electronic books, presentations, exercises and

laboratory works)laboratory works) Teaching staff (course responsible, assistants, PhD students, etc.) and students, Teaching staff (course responsible, assistants, PhD students, etc.) and students,

with different access rightswith different access rights Automated services for documents’ upload and publication.Automated services for documents’ upload and publication.

Events: Conference, Workshop, seminarEvents: Conference, Workshop, seminar A collection of articlesA collection of articles A set of presentation and administrative materials (organizing committees, web-A set of presentation and administrative materials (organizing committees, web-

portal, accommodation and travel information, etc.)portal, accommodation and travel information, etc.) A set of participantsA set of participants

A digital object may be part of a number of structured entities: A digital object may be part of a number of structured entities: e.g. an article may be the result of a project, it may be included into the e.g. an article may be the result of a project, it may be included into the

proceedings of a conference and it may be reference material for a courseproceedings of a conference and it may be reference material for a course

Page 7: CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA

77

RelationsRelations

Page 8: CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA

88

Standards and communication Standards and communication protocolsprotocols

http://mapageweb.umontreal.ca/turner/meta/english/metamap.html

Page 9: CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA

99

Standards and communication Standards and communication protocolsprotocols

MARC (MAchine Readable Cataloging) MARC (MAchine Readable Cataloging) promoted by the Library of Congresspromoted by the Library of Congress Used to exchange bibliographic information between Used to exchange bibliographic information between

libraries libraries Dublin Core metadata Dublin Core metadata

Standard for simplified metadata exchange Standard for simplified metadata exchange Z39.50 Z39.50

defines a protocol for client-server based information defines a protocol for client-server based information retrieval retrieval

The Open Archives Initiative (OAI) The Open Archives Initiative (OAI) a technical framework with client-driven interaction. The a technical framework with client-driven interaction. The

protocol supports interaction between a data provider and protocol supports interaction between a data provider and a service provider a service provider

Page 10: CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA

1010

Requirements for Requirements for Digital Content Management systemsDigital Content Management systems

Functional requirements:Functional requirements: Content submission (upload)Content submission (upload) Content storage: distributed, replicated,Content storage: distributed, replicated, Indexing and cataloging (based on metadata)Indexing and cataloging (based on metadata) Content search and retrieval Content search and retrieval

Based on metadataBased on metadata Based on full-text searchBased on full-text search

Users managementUsers management Access control and authorizationAccess control and authorization Content annotation and classificationContent annotation and classification Data processing servicesData processing services

Architectural requirements:Architectural requirements: Distribution of resources, services and usersDistribution of resources, services and users Transparent access to remote content (including other DL Transparent access to remote content (including other DL

resources)resources) Management of QoSManagement of QoS

Page 11: CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA

1111

A digital library model for scientific and A digital library model for scientific and technical purposestechnical purposes

User Interfaces

OAI Data Provider(content

harvesting)Metadata

ManagementContent

Management

User & Virtual Organization Management

Search EngineSecurity

Management

Presentation Layer

Business Logic Layer

Query ProcessorHistory Recorder

Ontology Metadata (SQL)

GRID infrastructureSE &SRM

Repository

Storage and communicati

onLayer

Page 12: CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA

1212

Information search and Information search and retrievalretrieval

Content search and retrieval:Content search and retrieval: Based on metadata – DB techniquesBased on metadata – DB techniques Based of full-text analysisBased of full-text analysis

Full-Text search:Full-Text search: Key-word searchKey-word search Semantic Information Retrieval (e.g. documents with semantic Semantic Information Retrieval (e.g. documents with semantic

annotations, semantic graphs, etc.)annotations, semantic graphs, etc.) Non-semantic Information Retrieval (e.g. probabilistic Non-semantic Information Retrieval (e.g. probabilistic

matching)matching) Processing sequence:Processing sequence:

Format conversion (DOC, PDF into TXT)Format conversion (DOC, PDF into TXT) Document parsing – rule-based key-words extractionDocument parsing – rule-based key-words extraction Heuristics for relevance processing (probabilistic, distance, Heuristics for relevance processing (probabilistic, distance,

semantic graphs, etc.)semantic graphs, etc.) ““Query by example” Query by example”

Page 13: CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA

1313

Non-semantic Information Non-semantic Information RetrievalRetrieval Naive Bayes Algorithm Naive Bayes Algorithm

Allows classification of new (unlabeled) documents based on learning Allows classification of new (unlabeled) documents based on learning document (labeled) sets document (labeled) sets

The algorithm determines the probability of words being related to a The algorithm determines the probability of words being related to a given topicgiven topic

Problems: Problems: does not treat the problem of similar words does not treat the problem of similar words words are considered independent of their context (“naïve Bayes”)words are considered independent of their context (“naïve Bayes”)

Topic-Based Vector Space Model Algorithm Topic-Based Vector Space Model Algorithm Treats the problem of similar words (synonyms are replaced)Treats the problem of similar words (synonyms are replaced) The steam of words are considered The steam of words are considered The algorithm associates a vector for every relevant wordThe algorithm associates a vector for every relevant word The similarity between 2 words is computed as the scalar product The similarity between 2 words is computed as the scalar product

between the two associated vectors; between the two associated vectors; A document vector is computed as a weighted sum of the containing A document vector is computed as a weighted sum of the containing

words’ vectorswords’ vectors We proposed an automatic weight computation based on the relevance We proposed an automatic weight computation based on the relevance

of a word to a given topic: of a word to a given topic: According to the proposed method the weight of a vector is computed According to the proposed method the weight of a vector is computed

as a function of its appearance frequency in the processed documentsas a function of its appearance frequency in the processed documents

Page 14: CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

CONTI’2008, 5-6 June 2008, TIMCONTI’2008, 5-6 June 2008, TIMISOARAISOARA

1414

ConclusionsConclusions

The paper presents a new vision on the design and The paper presents a new vision on the design and implementation of digital content management implementation of digital content management system. system.

The proposed ontology-based DL allows better The proposed ontology-based DL allows better content organization and retrievalcontent organization and retrieval

The model was implemented on a GRID infrastructureThe model was implemented on a GRID infrastructure As search and information retrieval two algorithms As search and information retrieval two algorithms

were implemented and tested.were implemented and tested. The naïve Bayes algorithm is faster but it is not context The naïve Bayes algorithm is faster but it is not context

awareaware The Topic-Based Vector Space Model Algorithm requires more The Topic-Based Vector Space Model Algorithm requires more

processing time and more interaction from the user, but the processing time and more interaction from the user, but the quality of the results is higherquality of the results is higher