Beyond Storage: Rethinking the role of repositories in scholarly communication
description
Transcript of Beyond Storage: Rethinking the role of repositories in scholarly communication
Beyond Storage:
Rethinking the role of repositories in scholarly communication
DELOS WorkshopDigital Repositories: Interoperability and Common Services
May 11, 2005
Sandy PayetteCornell University
First… is there a problem?
Existing scholarly communication system
• Does not mirror the reality of the scholarly process
• Published information artifacts do not resemble the rich information that is produced along the process
• Not evolved enough to enable easy and effective integration and dissemination of new, rich forms of digital information
D is c o nne c te d ne tw o rk s :fo rm a l public a tio n ne tw o rks o c ia l ne tw o rk (a c to rs )
H ybr i d ne tw o r kdo c um e nts ( fo r m al and i nfo r m al )datas e r vi c e sac to r s
D ata
Ac to r
F o r m al d o c u m en t
I n f o r m al d o c u m en t
D ata s e ts
W eb s er v ic e
The Future: Rich Scholarly Information Networks
Roles of digital repositories today
• Early Dissemination: – Enhance upstream scholarly communication– Improvement over traditional pre-print (paper) sharing among
scholars
• Open Access: – Harnad’s “subversive proposal”– Possibility of bypassing or eliminating traditional publisher model
• Document Discovery: – Searching for documents in a repository, – Federation or metadata harvest for search over multiple repositories
• Storage and Archiving: – E-print archives: author-self archiving gives scholars control over
their intellectual output– Institutional repositories: institutions commit to preservation
Evolutionary, but not revolutionary
• In many ways repositories represent an evolution of the traditional publishing paradigm– Submit documents– Gain access to documents…– Share results earlier in the scholarly process, and
electronically
• Still locked into document-centric paradigm– Store documents to promote access– Store documents to promote archiving– Index documents to promote search and discovery– Citation analysis to understand relationships of documents
Signs of Change – Scholars exercising the network
• Grid computing in sciences– Share computing resources– Share services and distributed virtual file systems– Examples
• Enabling Grids for E-Science (http://public.eu-egee.org/)• National Virtual Observatory (http://www.us-vo.org/)
• Humanities computing– Hyperlinked historical documentary editions– New Forms of Digital Scholarship
• Rossetti archive (http://www.rossettiarchive.org/)• Perseus (www.perseus.tufts)• Pompeii Forum (http://pompeii.virginia.edu)
– Tibetan and Himalayan Digital Library (thdl.org)
Vision for more revolutionary approach
The revolutionary opportunity…
• Looming on the horizon is the potential of a future scholarly communication system that is– Highly collaborative– Network-based– Data-intensive– Process-oriented
• We can change the way research and education is conducted by exposing rich knowledge-oriented information assets
• Digital repositories must be rationalized within this broader vision.
New Functionality
• Content aggregation: – combining information entities in novel ways
• Knowledge integration: – capturing semantic and factual relationships among
information entities
• Information reuse: – allowing secondary, tertiary products
• Information transformation: – combining information entities with computational services
• Collaboration and contribution: – blurring the line between authors, publishers, users,
experts…
A New Scholarly Information System
1. Redefine the “information unit” of scholarly communication
2. Create a scholarly communication system that better supports the process of research and learning
3. Record the “crumb trails” of the scholarly process
3 Basic Requirements
(1) The new “information unit”
• Documents• Text• Data• Simulations• Images• Video• Computations• Automated
Analyses
Data
Aggregations
(2) Process-oriented Scholarly Communication System
• Decompose the traditional process (Roosendaal & Geurts)– Registration (establish intellectual priority of result)– Certification (certify quality and validity of result)– Awareness (ensure accessibility)– Archiving (ensure availability for future use)– Rewarding (means to support tenure, promotion,
compensation)
• But, they missed some things…
(2) Process-oriented Scholarly Communication System
• Add new services to the mix– Workflow – Collaborative functions (e.g., annotation, re-use) – Data mining and analysis– Preservation monitoring and migration
• Expose all as network-accessible atomic services– Service discovery– Service invocation– Service aggregation, orchestration, choreography
Process-orientation - workflows
Validatebyte-
streams
Ingestto
Repo
Link to Simulation
Service
AssignAccessPolicy
Indexand
Register
Ingest-oriented process
VisitThe
Doctor
FormatMigration
ObjectVersioning
In Repo
MakeCopies
IngestTo
ArchivePreservation-oriented process
IngestTo
Archive
SIP
DigitalObject
World of Services
(3) Record the “crumb trails”
• Events– Critical state transitions of information assets– Preservation-noteworthy events
• Provenance– When we enable re-use and re-combination of
assets, we must be able to show from whence it came
• Relationships– Among information assets– Versions of an asset– Between agents and assets– Between services and assets
How are current repository technologies poised?
Selected repositories with notable features re: the vision
• Open-source repository software– Fedora– DSpace
• Installed Systems– aDORe (Los Alamos National Laboratory)– arXiv
• Grid projects– Storage Resource Broker (SRB)– Chimera
Fedora vs. the vision
• Flexible digital object model• Services associated with digital objects• Relationships among digital objects
– Relationship ontology– RDF-based metadata– Search the repository “as a graph”
• Upcoming – new security architecture– Policy enforcement (XACML)
• Repository policy• Object policies (fine-grained control)
Fedora Repository – Web Services
M anage AuthN AuthZ
Access Validation Re source Inde x
Storage Dissemination Registry
Fedora Repository M odules
M an ag e A c c e s sR e g is try
S e arc hR D F
In d e x
R E S T
C lie n tA pp
B a tchPro g ra m
O th e rS e rv ice
W e bB ro ws e r
R E S T S O A PS O A P R E S T S O A PR E S T
O A IP ro v id e r
R E S TWeb Services
Exposure
info :fe do ra/im age :1 1
la stM odD ate
hasM em ber
hasM em ber
h asR ep
h asR ep
in fo:fe dor a/ im ag e :1 1 / B LD G
in fo:fe dor a/ im ag e :1 1 / bde f:2 /g e tR e late dLe tte r
hasRep
i n fo:fe dor a/c ol l e c tion :1 / bde f:1 /M EM B ER S
info :fe do ra/im age :1 2
in fo :fe d o ra/c o lle c tio n :1
la stM odD ateh asR ep
"2 0 0 5 - 0 1 - 1 0 :1 1 :0 2 "
"2 0 0 5 - 0 2 - 0 1 :1 2 :0 5 "
lastModD
a te
"2 0 0 5 - 0 1 - 0 1 :1 0 :0 0 "
dc :
c rea
tor
"E lly C r am er "
d c:crea to r
"C h r is W ilp er "
in fo:fe dor a/ im ag e :1 2 / B LD G
d c:crea to r
"E d d ie S h in "
in fo:fe dor a/ im ag e :1 2 / bde f:2 /g e tHIGH
hasR ep
Fedora Objects – RDF Graph view
CollectionObject
MemberObject
DSpace vs. the vision
• The related Simile project is most interesting– Significance: semantic web technologies brought to
the task of search and discovery across different repository systems
– RDF-based search across heterogeneous metadata formats
– Ontology-based
• DSpace History system– Event recording– RDF-based
• Opportunity in DSpace 2– Web service exposure?– Service-based dissemination architecture?
LANL’s aDORe vs. the vision
• Standards-based repository architecture– OAI-PMH– MPEG21-DIDL– Open URL
• Very good example of the use of simple protocols to enable modular service-based architecture
• Services dynamically associated with objects
aDORe architecture
LANL
OpenURL
Ing
est
Repo Index
publisher
OAI-PMH
OpenURL
OAI PMH
Identifier Resolver
OAI PMH OAI PMH
CNRI handle, JAVA, C
MPEG-21DIP
Engine
Registry of trans-
formations
DID
Profile/BehaviorR
egistry
DIDwith DIM
OAI PMH
OAI PMH
FTXT
A&I
TechReport
Pre
-Ing
est
publisher
Ind
ata.la
nl.go
vA&I
publisher
AP
PL
ICA
TIO
N
123
4
5
6
7
Slide courtesy of Herbert Van de Sompel
arXiv vs. the vision
• Progress in decomposition and distribution of traditional steps in scholarly publishing value chain
arXiv – service pathways (decomposed and distributed)
Selected Grid vs. the vision
• SRB– Distributed, virtualized file system– Support for very large amounts of data– Data grid compatible with computational grid– Possible as backend persistent store for other
repository systems (e.g., DSpace, Fedora)
• Chimera– Derived data as first class information entities– Information model (Virtual Data System)– Process model (Virtual Data Language)
New Technical Architecture
The architecture challenge
• Current situation – Heterogeneous repository systems– Heterogeneous object models (or no object model) – Multiple protocols and service APIs– Services lacking formal interface definitions
• Can these resources ever play nicely together?• Need common abstractions…
Solution: Information Network Overlay
DataStores
DocumentRepositories
Databases
WebResources
PublisherRepositories
Information Network API
Source Layer
NetworkRepresentation
Layer
Client Layer
Translate to Technical Requirements
• Rich information objects– Integration of local and remote sources– Mixed genre
• Dynamic information objects– Integration with local and distributed services
• Graph-based information model to enable overlay– Nodes are information objects– Edges are relationships among those objects
• Service-oriented process model: – Coordination of information entities and services– Workflow; multi-step executions; transformations– Interoperable access and management API for objects
• Fine granularity access control
Pathways Project
• National Science Foundation Funding 2004-2007(http://www.infosci.cornell.edu/pathways)
• Van de Sompel, Payette, Erickson, Lagoze, Warner. Rethinking Scholarly Communication: Building the System that Scholars Deserve. D-Lib Magazine September 2004.
Vision: “Graphite” Information Model
Im ag e O b jectW e b r e so ur c e
G ra ph ite O v e rla y Fra g m e n t
L A N LR e p o s i t o r y
S erv ice-B
U R I-1 0
T yp eU R I-1
T yp eU R I-3
T yp eU R I-4
T ypeU R I-7T ypeU R I-8
arX iv F ed ora
T yp eU R I-6
T ypeU R I-2
U R I-1
U R I-4
U R I-7
U R I-9
Gr id da t a se t
U R I-2
D o cu m en t
T yp eU R I-5
U R I-8
U R I-6
U R I-5
U R I-3
Cornell/LANL Pathways Project
Most things can be represented as a graph of nodes and arcs.
Service-oriented process model
• Key challenge is to integrate a distributed service model within the information network overlay.
• Technologies to watch– OWL-S (W3C)
• Ontology-based service descriptions• Service modeled within semantic web
– Netkernel (1060research)• Enables a graph-like overlay for URI-identified resources• Information entities and services can be accommodated
– Grid technologies (Open Grid Services Infrastructure)• Enables creation of ‘virtual organizations’ that can share
distributed computational resources and services• Web-services and WSDL in latest incarnation
The W3C’s Take on Things…
• People and communities have data stores and programs to share
• Vision: Expanding Web of machine accessible resources
• Key Web technologies:• Web Services: Web of programs*
– Standards for interactions between programs on the Web – Easier to expose and use services
• Semantic Web: Web of data* – Standards for data, relationships, descriptions on the Web – Easier to Search for, Share, Aggregate, Extend information
• * abstractions :-)Source: http://www.w3.org/2004/Talks/0923-sb-whoiw3c/slide12-0.html
Conclusions: Implications for digital repositories
Beyond Storage
Must understand new scholarly activities and new technical developments…
so we can frame repositories within a broaderservice-oriented architecture.
What basic changes can occur now?
• Expose repositories as web services• Support compound digital objects
– Local and remote content– Any media type– Provide a way to associate services with objects (dynamic
views)
• Provide ability to assert relationships among objects• Move toward ontology-based metadata • Enable easy integration of repository with other
services
Example: Fedora Service Framework (2005-2007)
Fe dora Re po sito rySe rv ice
Serv ices
Apps
P re se rva tionInte grityS e rvice
Ex te rna lW orkflow
JHOV E
GDFR
Ba sicW orkflowS e rvice
Dialog Box Name
O KTex t:
Tex t
Tex t
Tex t
Tex t
Tex t
Canc el
H elp
Sample Text Here Sample Text Here Sample TextHere Sample Text Here Sample Text Here SampleText Here Sample Text Here Sample Text HereSample Text Here Sample Text Here
S am ple Tex t Here S am ple Tex t Here S am ple Tex t Here Sam ple Tex t HereS am ple Tex t Here S am ple Tex t Here S am ple Tex t Here Sam ple Tex t HereS am ple Tex t Here S am ple Tex t Here S am ple Tex t Here Sam ple Tex t Here
Fedora-Web-IRAdministrator
OAIP rovide rS e rvice
Dire ctoryInge st
S e rvice
W e b-ba se dsubm ission a ndba sic w orkflow
Fe de rationPID Re s olution
Se rvicePre s e rvation
M onitor ingSe rvice
Eve ntNotification
Se rvice
Fe doraS e a rchS e rvice
Dyna m icDisse m ina tor
S e rvice
PolicyBuilder
Other
Ser v ice
Research Challenges
• Enable low barrier to entry – Simple protocols (e.g., like OAI)– Light-weight (REST vs. SOAP?)– Simple tools to create overlays– Note complexity in setting up Grid-based services
• Integration of information and service models
• Security and Trust– Authentication and trust among repositories and services– Interoperability of authorization policy
• Preservation– Distributed and dynamic resources
Beyond Storage
Questionsand
Discussion!